Scale Beyond the Distillate: F to A in 8 Commits, Plus What Pro Tier Actually Adds

Three days. Three tool versions. Same portfolio. F (0/100) to A (100/100) in 8 commits, 75 minutes of Claude Code work, 16 unique WCAG findings caught and resolved.

Day 1 ran V0.2 public, the boring tier. Static TypeScript analyzer plus Playwright with axe-core. Three real findings on portfolio.sdet.it after stripping the noise floor. CI-grade work, deterministic, no AI in the loop.

Day 2 ran V0.3 public, the same backbone plus 5 AI specialists reading source. Two productive audit runs found 16 unique findings between them. Round 3 found zero new findings. Convergence, not regression.

Today, Day 3, runs V0.4 Pro on the same project. Same prompts. Same portfolio. Same WCAG 2.2 AA rules. What changes: the runtime, the auto-fix layer, the specialist roster.

Today: what Pro tier adds, and why it adds it.

The multiplicative token fix, in detail

Yesterday I mentioned a 4-line edit that resolved 6 findings. Worth zooming in.

Round 2 of yesterday’s audit returned 7 new findings. Six of them were contrast issues across six different files: Footer.astro (twice), Topbar.astro (twice), ArticleCard.astro, MatrixToggle.astro. Different selectors. Different components. Different parts of the page. Same root cause underneath.

The diagnosis came from the AI color-contrast specialist, not from the static analyzer. Static reports 6 contrast violations and stops there. The AI specialist reads the CSS source, follows var() indirection across files, recognizes the pattern: 6 visible symptoms, 2 design tokens, 1 root cause.

The fix:

:root {
  /* dark theme default, text on #141414 */
- --color-text-subtle: #737373;   /* 3.4:1, AA fail */
- --color-text-muted:  #a1a1a1;   /* 6.5:1, AA pass but tight */
+ --color-text-subtle: #a3a3a3;   /* 6.7:1, AA pass */
+ --color-text-muted:  #b3b3b3;   /* 8.8:1, AA pass with headroom */
}

Six files containing the symptoms were not opened. Footer, Topbar, ArticleCard, MatrixToggle. They didn’t need editing. Every component using --color-text-muted or --color-text-subtle against the dark background got the new ratio for free.

The architectural lesson is small but important. Design-tokens-first means single source of truth. Six components map to two tokens map to one fix. The spot-fix alternative was 6 separate file edits, 6 commits, six chances for drift. Token fix: 1 file (global.css), 4 lines, all 6 findings resolved.

Findability and fixability are different problems. Static rule engines find 6 contrast issues. That’s information. AI specialists find 6 issues and point to the 2 design tokens that cause them. That’s action.

This is also where the Pro tier story starts. AI doesn’t just find issues, it identifies root cause. V0.5 enterprise extends the same idea across repos: the same --color-text-muted issue across 5 consumer repos in a design-system organization, atomic patch all 5. Federation rather than spot-fixes.

AI semantic understanding times design-tokens-first architecture equals multiplicative impact. Get either right, you save time. Get both right, you get this.

Pro tier walk-through, V0.4 alpha.3

Pro V0.4 alpha.3 ships today on the same project. Here’s what it adds over public V0.3.

Multi-runtime. The public toolkit assumes Claude Code in-session. That’s the optimal path for solo developers running audits during code review. Pro adds two more runtimes. OpenCode subprocess works without a Claude Code session and is model-agnostic, so the same audit pipeline runs against GPT-5, Claude, or any model OpenCode supports. Ollama local runtime sends nothing out: the model runs on the auditor’s machine, source code never leaves the laptop, every prompt and response stays local.

The Ollama path matters for one reason: client compliance. Audit work for regulated industries (financial services, healthcare, public-sector contractors) cannot send client source through hosted LLMs. Ollama local runtime is the absolute minimum requirement to do that work at all. Public toolkit can’t do this. Pro can, by design.

The pipeline stays runtime-agnostic. Same WcagFinding shape, same dedupe step, same A through F grade. The runtime is a swappable backbone, not a feature.

Auto-fix engine. Two deterministic patchers ship in alpha.3: ImageAltPatcher for missing alt attributes on <img> elements, HtmlLangPatcher for missing lang on <html>. Both produce predictable output and write atomic commits per fix. On portfolio.sdet.it, the auto-fix engine handles 1 of 22 findings from the original audit, which is 4.5% coverage.

That number is the honest version of the story. Auto-fix engine handles roughly 5%, the mechanical patches like missing alt or missing html-lang. The other 95% are author and designer decisions. aria-label content needs human judgment. Color tokens need design-system buy-in. Heading restructure needs editorial decisions about content. AI specialists discover these. Humans fix them. That’s the public/Pro split.

Auto-fix saves the boring stuff. AI specialists save the architectural stuff. Pro is the integration of both.

Two specialists land in V0.4 alpha.4: modal-specialist and ecommerce-journey. Both Pro-only. Both niche. Both in the next sprint backlog.

This bit is honest scaffolding for the section. The alpha.4 sprint runs 06-09.05.2026, so neither agent is shipped at the time of this write-up. Treat the rest of this section as a plan, not an inventory.

Modal-specialist. Focus trap timing (when the trap engages, what it traps). Focus restoration on close (which element gets focus when the dialog closes, the trigger or the body). Escape key handling. aria-modal validation, including the cookie-banner anti-pattern where aria-modal="true" declares a region modal that the page is still operable around. The decision tree between dialog and alertdialog. Scroll-lock behavior on the body element when the modal opens.

The legal context isn’t decorative. Cookie banner with aria-modal="true" is the EAA pattern that gets sites sued. European Accessibility Act, June 2025 deadline. EU e-commerce now legally exposed when assistive-tech users get told the page is fully blocked by a region that is in fact still partly operable. Generic ARIA agents flag attribute presence. modal-specialist flags the wrong choice.

Ecommerce-journey. Variant change announcements through aria-live regions when the user picks size M then size L (price update, availability update, both spoken). Payment review step (WCAG 3.3.4 plus 3.3.6, financial-transaction-specific error prevention). Color-only stock indicators (green dot without text, red border without label). Cart toast aria-live politeness, polite vs assertive depending on context. Filter facet count updates announced after each filter toggle.

Modal-specialist isn’t one more agent. It’s 8 years of e-commerce audits in one prompt. You can rebuild it. It’ll take you the years I spent auditing 50+ e-commerce sites. Or you hire me.

The split is sharper than it looks. Generic keyboard agent flags missing onKeyDown. Modal-specialist flags focus restoration on close to which specific element. Generic forms agent checks labels. ecommerce-journey checks 3.3.4 review-step on payment forms specifically. Public toolkit gives you 5 specialists. Pro adds 2 you can’t write yourself unless you’ve audited e-commerce for years.

V0.5 enterprise, jarvis-brain and cross-repo

V0.5 enterprise tier (planned, post alpha.4): jarvis-brain integration. Different problem class.

The story is a scope shift, not a feature shift. Solo developer with one repo, one audit, one fix fits comfortably in V0.4 Pro. Design-system organization with 10 repos consuming shared tokens does not. The multiplicative token fix from earlier in this write-up, the one that resolved 6 findings on a single repo, scales differently when the same --color-text-muted issue exists across 5 consumer repos. Five spot-fixes, five PRs, five chances for drift. Or one atomic patch coordinated across the design system.

jarvis-brain is the multi-tenant knowledge vault that makes that coordination possible. Token federation across repos, DRY violation detection at the design-system layer, cross-repo deep dedupe (same defect in semantically different file paths collapses to one finding, not five).

V0.5 backlog also picks up three audit modes that the current public single-mode audit does not have. --scope component runs a Storybook-style audit on one component. --scope page runs a route-level audit, following imports two levels deep. --scope full runs the whole project, the default. Different audit unit, different output, different price point.

Plus a positive findings section in the report. What the audit confirmed working, not just what’s broken. Trust signal for design-system maintainers who need to demonstrate progress, not just remaining work.

Three tiers. Three problems. Three pricing levels. Aligned by audit scope, not arbitrary feature gating.

Tier comparison

Visual reference for the tier model:

graph LR
    subgraph Public["Public V0.3 (AGPL-3.0)"]
        A1[Static TS]
        A2[Dynamic Playwright]
        A3[5 AI Specialists]
        A4[Lead Orchestrator]
        A5[A-F Grading]
        A6[/wcag:audit skill]
    end

    subgraph Pro["Pro V0.4 alpha.3 (Commercial)"]
        B1[Everything from Public]
        B2[Multi-runtime CC/OpenCode/Ollama]
        B3[Auto-fix Engine]
        B4[wcag.config.ts]
    end

    subgraph Pro4["Pro V0.4 alpha.4 (Next sprint)"]
        C1[+modal-specialist]
        C2[+ecommerce-journey]
    end

    subgraph Enterprise["Pro V0.5 Enterprise (Planned)"]
        D1[Cross-repo via jarvis-brain]
        D2[Design System Federation]
        D3[Three audit modes]
        D4[Deep dedupe semantic]
        D5[Positive findings]
    end

    Public -.imports.-> Pro
    Pro --> Pro4
    Pro4 --> Enterprise

Pro imports from Public. That’s the architectural relationship, not a marketing tagline. Each upper tier wraps the lower one. The 5 specialists in Pro are the 5 from Public. The audit pipeline in Enterprise is the same one from Pro. Public is the foundation, not a crippled demo.

Read top to bottom: Public is what runs in CI on every commit, Pro is what runs daily on a maintained project, Enterprise is what runs across an organization’s repos. Different audit unit at each tier. Pricing aligned to that unit, not to feature gating.

Honest commercial framing

Last word on the public/Pro split. I want to be clear about what you’re paying for.

This isn’t “we hide features for money.” It’s “we charge for niche expertise you can’t write yourself unless you’ve audited e-commerce for years.” Public is education. Pro is the niche.

Three things you buy with a Pro license. First, niche expertise: modal-specialist and ecommerce-journey encode patterns from years of audit work, not generic ARIA scans. Second, multi-runtime: Ollama local for sensitive client repos, no tokens out, the absolute minimum compliance bar for regulated industries. Third, maintenance: the rules evolve as WCAG 2.2 becomes WCAG 3.0, prompts get patched against new anti-patterns, the toolkit stays current without you tracking the spec.

Three things you do not buy. The architecture (clone the public toolkit, learn from it, rebuild it). The 5 baseline specialists (those are public, AGPL-3.0, free). Magic (the audit still requires human review of findings, AI discovers, you decide).

You can rebuild the niche specialists. It’ll take you the years I spent auditing 50+ e-commerce sites. Or you hire me.

Public toolkit on GitHub: github.com/darco81/sdet-wcag-toolkit (AGPL-3.0). Pro tier at sdet.it/services. But before #01 wraps, one more thing.

Coming in #05: the audit that found 5,816 findings

While writing this, I ran V0.4’s new multi-page audit on the same portfolio. Same project, different scope. Round 4 router-scan: eleven routes, nine findings on three pages the homepage audit never touched. Token-level fix on /projects: one ruleset edit, six findings down. Same multiplicative pattern from Day 1, scaled across the site.

Then sitemap audit. Thirty-five routes, full published surface including articles and episode pages. Five thousand, eight hundred sixteen findings.

Four CSS-level commits later: seven. All seven are runtime false positives from a toolkit subsystem I wrote myself. Zero SERIOUS, zero AA failures. 5,816 → 7. Three orders of magnitude. Four commits. No JavaScript.

That’s #05’s story - June 2-4. Multi-page audit shows you the dependency graph of bugs, not just URL list. Single fix, multiplicative cleanup. The difference between an audit tool and a Lighthouse extension.

Series wrap, plus what’s next

From the field #01 wraps. Three days, one project, three tool versions, real numbers throughout.

Day 1: V0.2 baseline, 3 real findings caught by static plus dynamic. Day 2: V0.3 plus AI, 16 unique findings across triangulation runs, Round 3 finding zero new (convergence proof). Day 3: F to A in 8 commits, 75 minutes total Claude Code work, plus the Pro tier walk-through.

Series continuity tease. Week 2: jarvis-brain, the system that stops Claude Code from burning tokens on Grep and Glob by precomputing a semantic map served via MCP. Week 3: context-first QA workflow, the platform behind my daily Jira and Tempo automation. Week 4: performance audit 5-agent pipeline. Week 5: V0.4 multi-runtime build process.

If you’re building accessibility tools, design systems, or AI-driven dev workflows, follow #FromTheField. Real production, real numbers, real engineering humility. Next week: jarvis-brain.

The multiplicative token fix, in detail

Pro tier walk-through, V0.4 alpha.3

V0.4 alpha.4 preview, modal and ecommerce

V0.5 enterprise, jarvis-brain and cross-repo

Tier comparison

Honest commercial framing

Coming in #05: the audit that found 5,816 findings

Series wrap, plus what’s next

Related

Triangulation: AI Specialists Across Three Audit Runs

When axe-core Isn't Enough: Auditing My Own Portfolio with V0.2 Public

Multi-page WCAG, Part 1: the machine behind 5,816 to 7