Yesterday’s V0.2 audit on this same portfolio found 3 real findings. Static plus dynamic, no LLM in the loop, deterministic floor doing what it does. Today I’m running V0.3 public on the same target, same routes, same Astro 5 build. V0.3 adds 5 AI specialists reading source through Read/Grep/Glob.

Two productive audit runs, eight hours apart. 16 unique production findings caught between them. A third run, after fixing the first two batches, finds zero new findings. That’s the number worth opening on: 16 caught, 0 remaining.

Two productive rounds, one convergence round. That last round is the part most AI-audit case studies leave out, because it’s the part that requires running the same tool against the same project a third time and admitting if anything new shows up. Nothing did. Three independent runs, three different LLM scan surfaces, all converged on no production issues.

This is the most honest accessibility audit data I’ve shipped. It’s also the part of the toolkit story I had no way to tell yesterday. Today’s tier reads the JSX, not the rendered HTML. Different layer, different jobs, different findings.

V0.3 architecture

V0.3 adds five specialists. Each reads source. Each has focused scope. They run in parallel.

The five are: semantic-structure (heading hierarchy, landmark coverage, lang attributes, modal heading rank), aria-patterns (ARIA misuse, live region politeness, dialog type taxonomy), keyboard-interaction (composite widgets, focus management, onClick without onKeyDown, APG keyboard tables), color-contrast-static (CSS contrast computed from source, color-only indicators, prefers-* media queries), and forms-accessibility (labels, validation timing, autocomplete, payment review steps).

Each specialist receives a focused prompt and has access to Read, Grep, Glob, and LS over the project source. No write access, no shell access, no internet. They open files, look at code, return JSON findings with ruleId, file:line, severity, WCAG SC, and a suggested fix. They don’t run the project, don’t render anything, don’t compute pixel values. They read.

A Lead orchestrator dispatches all five through Claude Code’s Task tool in a single parallel batch. Five specialists run concurrently. The orchestrator waits, collects, merges, and dedupes by (ruleId, file:line, url). Findings that match the static or dynamic backbone collapse to one entry.

graph TB
    A[Project Source] --> B[Lead Orchestrator]
    B --> C[5 AI Specialists in parallel]
    C --> D1[semantic-structure]
    C --> D2[aria-patterns]
    C --> D3[keyboard-interaction]
    C --> D4[color-contrast-static]
    C --> D5[forms-accessibility]
    D1 --> E[Read/Grep/Glob source]
    D2 --> E
    D3 --> E
    D4 --> E
    D5 --> E
    A --> F[Static TS Analyzer]
    A --> G[Dynamic Playwright + axe]
    E --> H[Findings Merge + Dedupe]
    F --> H
    G --> H
    H --> I[Score + Grade A-F]
    I --> J[Reports: dev + exec]

Same merge step as V0.2. Same WcagFinding shape. Same dedupe logic. The static and dynamic backbones still run, still emit findings, still feed the same penalty model and A through F grade. AI is the third source, not the replacement.

Run 1 (morning), what AI caught that V0.2 missed

Morning audit on the same portfolio. V0.2 found 3 contrast findings. V0.3 found 9, including those 3.

The 9 findings split into three clusters. The first is aria misuse: three findings under WCAG 4.1.2 Name, Role, Value. Hero.astro:17 had aria-label="C64 boot" on a <p> element, overriding the visible text. ProjectCard.astro:45 had aria-label="Tech stack" on a <ul>. EcosystemCard.astro:40 had aria-label="Key metrics" on a <dl>.

Static missed all three because aria-label values are strings. The regex matches the attribute, not whether the attribute is appropriate on a paragraph that already has implicit role and visible text. Dynamic missed them too. Axe checks attribute presence, not semantic appropriateness, and the rendered HTML still has the aria-label intact. The aria-patterns specialist read the JSX, recognized that paragraphs and definition lists already carry semantics, and flagged the overrides as anti-pattern.

The second cluster was contrast across a broader surface. Five findings, all WCAG 1.4.3. Three came back from V0.2 dynamic (the .hero-load line and the three .license badges from yesterday). The AI color-contrast specialist also caught all three by reading the CSS source directly. The new one was ArticleCard.astro:136 .meta at 2.98:1 on a dark gradient background. Dynamic axe didn’t catch it, because axe walks visited pages and computes contrast against the actual rendered state. The article cards in question never landed in axe’s sweep with that exact gradient combination. The AI specialist read the CSS, noticed --color-text-subtle against the dark background token, and computed the ratio manually.

The third cluster was a single minor finding: contact.astro:52 had <ul role="list">. Redundant ARIA. WCAG 4.1.2. The semantic-structure specialist flagged it with a one-line fix: drop the role attribute, the implicit role is already there.

Six issues V0.2 couldn’t see, plus the three it could. Different layer, different findings. The aria misuse is invisible to anything that doesn’t read source. The new contrast finding is invisible to anything that doesn’t follow CSS variable indirection. That’s the discovery payoff for source-reading agents.

Round 1 fix

I fixed all 9. Six commits, twenty-five minutes. Quick details:

FixFileWCAG SCCommit
Remove aria-label from .hero-load <p>Hero.astro:174.1.20c0e19e
aria-label → sr-only <h4> + aria-labelledby (tech stack)ProjectCard.astro:454.1.29bca4bf
Same pattern (key metrics)EcosystemCard.astro:404.1.2b141c52
Remove redundant role="list"contact.astro:524.1.2356ae8c
.meta color: text-subtletext-mutedArticleCard.astro:1361.4.33498856
.license color: accent-mutedaccentProjectCard.astro1.4.34929186

The aria-label fixes weren’t all the same. On the <p>, removing the attribute was the right call: the visible text already says it. On the <ul> and <dl>, the labels were structural (“tech stack”, “key metrics”), so I added a screen-reader-only <h4> plus aria-labelledby pointing at it. Different replacement strategy per element, same anti-pattern caught.

The .license fix swapped a single design token (--color-accent-muted to --color-accent), which resolved all three license badge findings in one commit. First taste of token-level multiplicative impact.

Re-run the audit, expecting a clean grade. Got 7 NEW findings instead. That’s where it gets interesting.

Run 2 (afternoon), the triangulation insight

Same portfolio, post-Round-1, second audit. Seven new findings. None of them regressions. All of them in places Run 1 didn’t touch.

The list, in order of severity: Footer.astro:269 .footer-links .note at 1.77:1 (the worst ratio I’ve shipped, dark theme). Footer.astro:259 .footer-links a at 2.80:1. ArticleCard.astro:123 .description at 2.80:1. Topbar.astro:194 .icon-button at 2.94:1. Topbar.astro:167 .nav-desktop a at 2.94:1. MatrixToggle.astro:69 .matrix-hint at 3.15:1. And one outside the contrast cluster: articles/index.astro:47 filter pills missing aria-pressed (WCAG 3.3.2 plus 4.1.2).

The pattern: six of the seven were contrast, all using --color-text-subtle or --color-text-muted from the dark theme stylesheet. Single root cause, six visible symptoms, six different files. The seventh was a forms issue, toggle button missing programmatic state.

Why didn’t Run 1 surface these? The honest answer is that LLM specialists scan slightly different surface per run. Different prompt activation, different file traversal order, different attention. Run 1 hit the aria-misuse cluster because the morning prompts steered the specialists toward semantic elements with explicit ARIA. Run 2 hit the layout shells (Footer, Topbar) and content surfaces (ArticleCard description, MatrixToggle hint) because the prompts in that run, working on a freshly fixed codebase, opened a different set of files.

This is not a bug. This is the design. AI specialists trade determinism for breadth. One run sees one slice. Three runs see the project.

The counter-narrative I want to push back against is the loud one: “AI auditors are unreliable because they’re nondeterministic.” That framing measures the wrong thing. The right measure isn’t “did you produce identical output twice?” It’s “did multiple runs converge on the same final state?” Convergence on no findings across multiple independent runs is the strongest quality signal a probabilistic auditor can give you. Static rules can’t even produce that signal, because they only check what they were programmed to check.

Two complementary modes, not competitors. Static catches what’s there in a deterministic, CI-friendly way. AI catches what’s intended through source semantic understanding, with breadth that requires multiple runs to fully harvest. The boring tier and the AI tier do different jobs.

I had a choice at this point: ship knowing only Run 1, or fix what Run 2 found. I went deeper.

Round 2 fix and the multiplicative token impact

Two commits. Twenty-five minutes. Seven findings resolved. One commit fixed six findings with four lines of CSS.

:root {
  /* dark theme default, text on #141414 */
- --color-text-subtle: #737373;   /* 3.4:1, AA fail */
- --color-text-muted:  #a1a1a1;   /* 6.5:1, AA pass but tight */
+ --color-text-subtle: #a3a3a3;   /* 6.7:1, AA pass */
+ --color-text-muted:  #b3b3b3;   /* 8.8:1, AA pass with headroom */
}

That single commit (03ae229) touched global.css and exactly nothing else. The six files containing the symptoms (Footer×2, Topbar×2, ArticleCard, MatrixToggle) were not opened. They didn’t need to be. Every component using --color-text-muted or --color-text-subtle against the dark background got the new ratio for free.

The forms fix was a separate commit (ab4d716): added aria-pressed plus a small JS toggle sync to the filter pills in articles/index.astro. Filter pills now announce “pressed” or “not pressed” to screen readers, which is what WCAG 3.3.2 (Labels or Instructions) and the toggle button pattern from APG both expect.

Static analyzers report 6 contrast violations. The AI specialist reads source, recognizes the pattern: 6 visible symptoms, 2 design tokens, 1 root cause. One commit, four lines, six findings resolved. That’s design-tokens-first architecture meeting AI semantic understanding. More on that pattern tomorrow.

Re-audit time. Round 3.

Run 3, convergence

Round 3, same skill, same portfolio, post-Round-1-and-2. All 5 AI specialists return empty arrays.

The numbers: AI specialists 0 findings each (5 of 5 succeeded). Static analyzer 2 findings, both playwright-report/index.html (the Playwright HTML reporter output, generated, never deployed). Dynamic 4 raw findings deduped to 1 unique, all <astro-dev-toolbar> (the Astro dev-mode injection that doesn’t ship to production). Total real findings, after stripping the documented noise floor: 0. Score: 100/100. Grade: A.

Three independent runs across 8 hours. Three different LLM scan surfaces, no shared context between runs. All converging on no production issues for the same codebase. That’s the convergence signal. It’s not a single point measurement of whether an LLM is right. It’s a multi-run aggregate that asks whether the project is right.

Most AI-audit case studies stop at run 1. “Tool found N issues.” One audit, no verification, no convergence test, no honest framing of what the tool sees vs. what it doesn’t. The data here goes further: 9 found in run 1, 7 found in run 2 (different surface, not regression), 0 found in run 3 across independent specialists. 16 unique findings caught, 0 remaining. Multiple runs are the unit, not single runs.

Convergence is the design point. Multiple runs aggregate to broader coverage than any single run can give. Round 3 finding zero new is what “done” looks like for an AI-driven audit. Not “the LLM said no problems,” but “three independent passes, three different scan surfaces, all said no problems.”

But convergence on homepage isn’t convergence on the site

Three runs on the homepage said convergence. Same homepage, every audit pass, eight hours apart, three different LLM scan surfaces, zero new findings. Clean signal - for one URL.

Then V0.4 landed. Multi-page audit with 4-strategy auto-discovery: sitemap, router-scan, AI agent reading the project structure, JSON config when none of those work. Same toolkit, broader surface. I ran V0.4 router-scan on the same portfolio.

Nine new findings on three pages the homepage audit never touched. /privacy. /numbers. /projects. Structural HTML - <ul> markup where it should have been <dl>, .archived badge color contrast at 3.0:1, missing definition list semantics on stat blocks. None of them visible from the homepage’s perspective, because none of those pages were in the homepage audit’s surface to begin with.

Round 4 fix sprint: 35 minutes, three atomic commits. Re-audit on router-scan: clean. Then I switched to sitemap strategy and ran the audit on the full published site. 35 routes.

5,816 findings.

Long pause.

99.86% of those 5,816 were ONE bug. Shiki light-theme tokens leaking on light backgrounds across every code block on every article page. Single config change in the Astro Markdown integration, 5,808 findings cleared. The remaining 8 split into 1 residual .archived badge token (one-line fix) and 7 keyboard-trap-runtime false positives on long article pages - filed as a toolkit issue for v0.5+ (natural focus-cycle completion misclassified as a trap).

Pattern: single-page audit measures one URL. It’s a Lighthouse extension at best. Multi-page audit with 4-strategy auto-discovery measures the actual production surface. That’s what the word “professional” earns.

Convergence and coverage are independent quality dimensions. Three runs converged on the homepage. The site needed a different tool to find the rest.

Quick honest moment, the dogfooding bug

I shipped V0.3 on Friday. Sunday I dogfooded it on this same portfolio. The audit returned 10 findings, but with a footnote: “All 5 AI specialists returned errors.”

The bug was in the /wcag:audit skill, not the toolkit code. The SKILL.md instructed Claude Code to invoke the CLI via Bash subprocess with --use-ai. That subprocess runs Node directly, where there is no globalThis.Task, because the Task tool only exists inside Claude Code’s JS runtime. The 5 specialists silent-failed at dispatch and the orchestrator fell through to static plus dynamic only. The CLI’s design was correct (graceful degradation when Task is unavailable). The skill’s design was wrong (routing AI through a layer where Task can’t reach).

I missed it in the smoke test for an embarrassing reason. The smoke test ran from inside a Claude Code session, where Task is available. Task calls succeeded, output looked clean, I shipped. The production user flow (clone repo, open Claude Code, run /wcag:audit) hits the Bash subprocess path, where Task isn’t available. Different code path, different result.

The fix took 45 minutes and one file. Refactored the skill to dispatch Task calls directly inside the Claude Code session (5 parallel calls), reserving the CLI subprocess for static plus dynamic only. Pattern: the skill is the orchestration document, the CLI is the deterministic engine, don’t mix layers. v0.1 of the toolkit’s earlier wcag-static-analyze skill got this right. v0.3 regressed. Now it doesn’t. One file changed, 134 insertions, 37 deletions, no code change. Verified on a react-basic fixture: 19 findings, matches the Pro alpha.2 baseline of 19. Parity confirmed.

Public toolkit shipped Friday. Bug discovered Sunday. Fixed Monday. One file changed. That’s why dogfooding before publication matters.

Smoke test in Claude Code is not the real user flow through a skill. Caught it. Fixed it. The numbers above are from the fixed skill.

Tomorrow

Final piece tomorrow. F to A in 8 commits, 75 minutes total Claude Code work. One token edit, four lines, six findings resolved, in detail. Pro tier on the same project: multi-runtime (Claude Code, OpenCode subprocess, Ollama local for sensitive client repos), auto-fix engine with deterministic patchers (image-alt, html-lang), niche specialists landing in alpha.4 (modal-specialist, ecommerce-journey). Live auto-fix demo with before-and-after grade. Honest commercial framing: public is education, Pro is the niche, you can rebuild it or you can hire me. Series tease: next week, jarvis-brain.

If you’re shipping accessibility work in 2026 and using AI in your stack, Part 3 is for you. #FromTheField.