When axe-core Isn't Enough: Auditing My Own Portfolio with V0.2 Public

I’m three days into auditing my own portfolio with my own toolkit, live, in public. Today is day one, and I’m running the boring tier.

Portfolio.sdet.it is an Astro 5 dual-domain site I’m shipping for real. Not a fixture, not a contrived demo. Real content, real components, real hosting on a VPS I configured a week ago. Three days of audits, three tool versions, same target.

Day 1 (today) is V0.2 of my public WCAG toolkit. Static TypeScript analyzer plus Playwright with axe-core. Deterministic, CI-friendly, no LLM in the loop. The kind of tier you trust to fail your build at 2am.

Day 2 will run V0.3 public, the same toolkit plus 5 AI specialists reading source through Read/Grep/Glob. Day 3 will run V0.4 Pro, the commercial tier. Each day adds capability. Each day finds findings the previous tier missed.

Today’s tier is what runs in CI. It catches what’s there in HTML and CSS. It misses what’s intended in source. Here’s what that looks like on a real project.

A quick word on how I got here

Backstory before we look at numbers: this isn’t where I planned to be.

Day 1 of the WCAG sprint, four hours in, I had a working TypeScript analyzer that pattern-matched HTML for missing alt attributes, missing lang, missing landmarks. AI agents wrapped around it, translating axe-style findings into prettier output. Same regex matching at the core, fancier syntax on top. Another axe-core wrapper.

I deleted it.

A regex doesn’t know that onClick={handler} without onKeyDown={handler} breaks keyboard users. AI reading the JSX does. That was the pivot. AI specialists read source through Read/Grep/Glob and write findings in their own words. Static rules become the deterministic fallback for CI, not the main flow. Two layers, different jobs.

That’s why V0.2 looks the way it does. It’s not “the whole tool.” It’s the deterministic floor of a tool whose discovery layer is AI. CI-friendly, fast, predictable. You run it on every commit and it tells you what’s rendered wrong. It does not tell you what’s structurally wrong in source.

The full pivot story is its own post. For now: V0.2 is what you keep when you decide AI is the discovery layer, not the wrapper.

V0.2 architecture

V0.2 public has two paths. Both deterministic. Both fast. Neither AI.

The static TypeScript analyzer pattern-matches HTML and CSS source files. Missing img alt attributes, missing html lang, missing landmark roles, redundant ARIA roles. Things you can find with a regex over source text. Zero LLM calls, zero tokens, sub-second on a portfolio-sized repo. CI-friendly by design.

The dynamic path runs Playwright against the dev server, then runs axe-core inside a real browser. Computed contrast ratios, focus indicators, keyboard-reachable controls, ARIA in the rendered DOM. Catches what’s actually shipped to users, not just what’s in source. Slower than static (about 2 seconds for a 4-route audit), but it sees what regex can’t compute.

graph LR
    A[Project Source] --> B[Static TS Analyzer]
    A --> C[Dev Server]
    C --> D[Dynamic Playwright + axe]
    B --> E[Findings Merge]
    D --> E
    E --> F[Markdown Reports]

Both paths emit the same WcagFinding shape: ruleId, file:line (or url:selector), severity, WCAG SC reference, suggested fix. The orchestrator merges them, dedupes by (ruleId, file:line, url), scores against a penalty model (-15 critical, -10 serious, -5 moderate, -2 minor, floor at 0), and emits an A through F grade.

That’s the deterministic floor. It’s not nothing. It’s just not enough.

Numbers from the V0.2 baseline

Here’s what V0.2 saw on portfolio.sdet.it.

Source	Findings	Real	Notes
Static TS	2	0	Both `playwright-report/index.html` (test artifact)
Dynamic axe	4	3	`.hero-load` + `.license` × 3 contrast
Dynamic focus-visibility	4	0	All `<astro-dev-toolbar>` (Astro dev injection)
Total	10	3

Severity breakdown: 0 critical, 10 serious, 0 moderate, 0 minor. Score: 0/100 (penalty 100, floored). Grade: F. Wall time: 2.14 seconds.

That’s a failing grade with a clean ten-finding report. Looks dramatic. Reality is more boring: 7 of 10 findings were noise. Test artifacts and dev-mode injections that never ship to production.

The playwright-report/index.html lives in playwright-report/ because the Playwright HTML reporter writes there after every test run. That HTML is autogenerated, never deployed, but the static analyzer scans it because it lives in the repo. Fixable with a .wcagignore glob, on the roadmap.

The <astro-dev-toolbar> findings are the Astro dev-mode toolbar injection. It only exists when you run astro dev. It never ships to production. Re-running against a pnpm preview build eliminates them. Documented noise.

Honest number after stripping noise: 3 real bugs. All contrast. All in the dark theme. All would slip past V0.2 if they didn’t render in browser. Static caught zero of them, because all three hide behind CSS variable indirection and design tokens.

Three findings on a 23-page site. Clean. But that’s because static + dynamic only catches what’s there in HTML and CSS. The bugs in source are still in source.

The 3 real findings

Three real bugs. Same root-cause shape: design token misuse hiding behind CSS variables. Static didn’t see them. Dynamic did.

Bug 1 lives at src/components/sections/Hero.astro line 17. The hero load line (“KERNAL READY”) renders #38935a on white at 3.82:1. WCAG 1.4.3 wants at least 4.5:1 for normal text. Source uses var(--color-accent) plus opacity: 0.85, which static analysis can’t multiply. Dynamic axe ran the page, got the computed color, computed the ratio, flagged it.

Bugs 2-4 are the same root cause across three project cards. File: src/components/cards/ProjectCard.astro lines 199-202. The .license badges (AGPL-3.0, MIT featured, MIT third card) render #22c55e on #fafafa at 2.18:1. That’s well below the 3:1 floor for non-text contrast, let alone 4.5:1 for text.

One design token (--color-accent-muted, resolving to a too-light green in light theme) feeds three rendered components. Static analysis sees three separate .license selectors with var(--color-accent-muted) and can’t follow the indirection to know what RGB value comes out the other side. Dynamic axe walks each project page, computes the actual rendered color, reports three findings. Three visible symptoms, one root cause, but V0.2 reports them as three separate findings because that’s all it can see.

These are the wins of dynamic testing. The losses come tomorrow.

What V0.2 is missing

Here’s the kicker. There are nine more production WCAG bugs in this codebase. V0.2 won’t find them. Not in this run, not in ten more runs.

Sneak preview without spoilers (those are for Part 2):

aria-label misuse on semantic elements: aria-label="C64 boot" on a <p>, aria-label="Tech stack" on a <ul>, aria-label="Key metrics" on a <dl>. Three instances, three different elements, one anti-pattern.
Heading hierarchy gaps in MDX content.
A token-level contrast issue affecting six components on the dark theme through a single --color-text-subtle value.
Filter pills missing toggle state (aria-pressed).
A few smaller things.

Why static can’t see them: aria-label values are strings. A regex matches the attribute presence, not whether the attribute is appropriate on the element. Token misuse propagates through var() indirection across files. Toggle state requires understanding interaction context, not text.

Why dynamic can’t see them either: most aria misuse is in JSX or Astro templates, and the rendered HTML still has the attribute (axe checks presence, not appropriateness). The visited-pages-only sweep misses internal article routes. ARIA attribute presence does not equal semantic correctness, and that’s a semantic call, not a pattern.

These are the wins of AI specialists reading source. They open a file, recognize that a <p> already has an implicit role and visible text, and flag the aria-label as an override anti-pattern. They follow var() chains across files. They reason about toggle state.

Tomorrow Part 2.

Honest cliffhanger

If V0.2 was the whole tool, I’d ship it as “use this in CI, manually audit the rest.” That’s where most static toolkits stop, and it’s a reasonable place to stop. Three production bugs caught on a live portfolio is real value. CI failing on contrast 3.82:1 versus required 4.5:1 is exactly the kind of guardrail static analysis exists to provide.

But it’s not enough. Nine more bugs sit in source right now, and no amount of re-running V0.2 will surface them. Different scan surface, different layer. Static asks “what is rendered.” Dynamic asks “what is computed.” Neither asks “what is intended.”

Tomorrow Part 2: same portfolio, V0.3 public adds 5 AI specialists reading source through Read/Grep/Glob. Spoiler: 16 unique findings caught across two audit runs. Triangulation, not regression. #FromTheField.

Repo and clone

Repo: github.com/darco81/sdet-wcag-toolkit (AGPL-3.0).

Quick start:

pnpm install
pnpm -r build
wcag-toolkit audit . --url http://localhost:4321

That gets you the V0.2 baseline (static + dynamic). For the full V0.3 AI tier, open Claude Code in your project root and run /wcag:audit. The skill dispatches 5 specialists in parallel through the Task tool, merges findings with the deterministic backbone, emits an A-F grade.

Pro tier at sdet.it/services: multi-runtime (Claude Code, OpenCode, Ollama local for sensitive client repos), auto-fix engine, niche specialists.

Series continues tomorrow. Part 2: triangulation.

A quick word on how I got here

V0.2 architecture

Numbers from the V0.2 baseline

The 3 real findings

What V0.2 is missing

Honest cliffhanger

Repo and clone

Related

Triangulation: AI Specialists Across Three Audit Runs

Scale Beyond the Distillate: F to A in 8 Commits, Plus What Pro Tier Actually Adds

Multi-page WCAG: 4 frameworks with full route-discovery, plus 4 recognised