Context-First QA, Part 2: The 10 Maps

Yesterday I showed the math. 1000 tasks, $700 vs $40, 70% never hits an LLM.

Today I show the map.

Ten architectural layers that need to be deterministic before AI can do anything useful. Four of them I’m publishing as code drops over the next 8 weeks. The other six stay in pitch-mode and I’ll tell you why.

Recap: the sandwich

If you missed Part 1: deterministic input, AI middle, deterministic output. The LLM lives in a small bounded slot in the middle. Code is the bun, model is the patty. Everything else is condiments.

But “deterministic in, AI middle, deterministic out” is the spine. Now we put meat on the bones.

Three-Layer Architecture

Every piece of the pipeline follows the same internal pattern.

graph TB
    subgraph Raw["Layer 1: Raw (deterministic capture)"]
        J[Jira API raw]
        F[Figma MCP raw]
        P[Playwright snap raw]
    end

    subgraph Enriched["Layer 2: Enriched (deterministic ETL)"]
        TE[Token extraction]
        AE[ARIA + computed CSS]
        CE[Context bundle]
    end

    subgraph Builders["Layer 3: Builders (structured output)"]
        TC[Task context object]
        BP[Baseline + current pair]
        QC[QA prompt context]
    end

    Raw --> Enriched --> Builders --> AI[AI Decision]
    AI --> O[Validated output]

Three internal layers, repeated across every component:

Raw - the wild west. Jira’s HTML, Figma’s design-tokens JSON blob, Playwright’s full DOM dump. Source-of-truth fetch, nothing more.
Enriched - deterministic ETL. Parse the ADF. Walk the Figma tree. Compute styles. Still no LLM, still no interpretation - just code that knows the shape of each source.
Builders - typed objects. Whatever passes from here forward has a schema, a contract, a test. The LLM, when it eventually shows up, sees only Builder output.

This is the same pattern that powers my WCAG toolkit (Series #01, May 5-7). Same pattern shows up at scale in multi-page audits (Series #05, coming June 2-4). Same pattern in Figma-to-code (#07, June 16-18). It’s not novel. It’s just disciplined.

The 10 maps

Ten themes. Each one is a layer that needs to be deterministic before the AI gets involved. Here’s the full map.

A: Input Layer - capturing the wild west deterministically → Episode #11, July 14-16

Jira API + Figma MCP + Playwright snap. Three messy sources. One typed QAContext object. AI never reads raw HTML.

B: Decision Layer - 70% without an LLM → Episode #10, coming ⭐

The decision gate from Part 1. CSS diff + pixel diff + a11y regression check = deterministic verdict. 70% of tasks exit here.

C: Output Layer - Atlassian ADF without tears → Episode #09, July 8-9 ⭐ TOP FLAGSHIP

Jira ADF inline images via a 303 redirect. The trick that solves a famous Jira pain. Plus markdown-to-ADF bridges and multi-stage uploads.

D: Orchestration - many agents, one story → August/September

Partial continuations, heartbeats, retry semantics. Multi-agent dispatch where every agent knows its scope and its boundary.

E: HITL Safety - every WRITE asks permission → Episode #12, July 21-23 ⭐

Action queue state machine. Atomic approve-and-execute with partial-failure tracking. The dedup cautionary tale where my UNIQUE constraint was wrong.

F: Vendor-Agnostic Infra - swap provider, not orchestrator → September

Extracting vendor-agnostic QA infra from a Claude-bound dispatcher. The pattern that makes a multi-LLM future cheap.

G: Cost & Telemetry - cents, not dollars → Q4 2026

Per-task cost attribution. Token telemetry per worker. When to cache, when to batch, when to fail fast.

H: Operational Discipline - calendars, dedup, UNIQUE was wrong → Q4 2026

Five gotchas I hit in production. UNIQUE on (task, run) seemed obvious - until partial retries broke it.

I: Multi-Process Glue - terminals, tmux, CI → Q4 2026

One script, three backends. Same shell across Claude Code, OpenCode subprocess, direct API.

J: Approval UX - HITL UI that doesn’t feel like work → Episode #13, August 4-6

Drag-drop approval queues. Countdowns. Editable QA reports. UX I built so I’d actually use my own agent.

Why ten

Why not three? Why not twenty?

Because each of these layers is independently swappable, deterministic (or has bounded LLM scope, no creep), and has its own test surface. Deterministic layers get unit tests. LLM layers get contract tests against schemas. Both layers can fail loudly, and both can be debugged in isolation.

The anti-pattern this avoids: the “agent does everything” trap. When one component has no boundary, it has no testable failure mode. When ten components each own one thing, you can fix what’s broken without rewriting what works.

The input layer up close

Layer A is the most interesting to start with. It’s where everyone makes the first mistake.

The mistake: feeding raw Jira HTML to an LLM. The model reads the markup, hallucinates the field structure, misses an attachment, and produces a verdict based on half the ticket. You wouldn’t know until production.

The fix: a single-call task aggregator that returns a typed QAContext object. Here’s the shape.

// Single-call task aggregator - one webhook, full context
// This is what AI eventually sees. Pre-validated, structured, typed.

async function assembleContext(taskId: string): Promise<QAContext> {
  // Parallel deterministic fetches
  const [jiraRaw, figmaRaw, snapRaw] = await Promise.all([
    fetchJiraTask(taskId),         // n8n webhook, full task + comments + attachments
    fetchFigmaBaseline(taskId),    // MCP call, design tokens + components
    capturePlaywrightSnap(taskId), // headless browser, aria + computed CSS
  ]);

  // Layer 2: enrichment (still deterministic - no LLM)
  const enriched = {
    jira: parseJiraADF(jiraRaw),
    figma: extractTokens(figmaRaw),
    snap: { aria: snapRaw.tree, css: snapRaw.computedStyles },
  };

  // Layer 3: builder pattern - typed output AI can trust
  return {
    taskMeta: { id: taskId, type: enriched.jira.type, severity: enriched.jira.severity },
    baseline: enriched.figma,
    current: enriched.snap,
    diff: computeStructuralDiff(enriched.figma, enriched.snap),
  };
}

Three parallel fetches. Each returns raw data. Each gets enriched separately. Then a builder produces a typed QAContext - that’s what the LLM eventually sees.

Token-wise: the naive version sends ~8K tokens to the model (raw Jira HTML alone is huge). The structured version sends ~2K. That’s 4x cheaper just from the input layer, before any decision-gate routing.

And the AI literally cannot read raw HTML in this architecture. It only sees QAContext. If the parser breaks, the build breaks. If the model misinterprets a typed field, that’s a model failure with a logged input, not a parsing hallucination.

What’s next

Tomorrow: the calendar.

Four of these ten layers (B, C, A, E) publish as full code drops over the next 8 weeks. The other six stay in pitch-mode for now - they’re production-specific, multi-tenant, compliance-sensitive, or simply not the right scope for a public destylat.

I’ll show you exactly which week each one lands, what the repo branch looks like, and how to engage if you want this pattern in your stack.

Map is upside-down for most teams. Let me show you mine.

Recap: the sandwich

Three-Layer Architecture

The 10 maps

Why ten

The input layer up close

What’s next

Related

Context-First QA, Part 1: The Thesis

ADF Without Tears: The 303 Trick for Inline Images in Jira

ADF Without Tears: The Full Pipeline and the Repo