1000 QA tasks. Naive approach: every one through an LLM. Cost: a small mortgage.

My approach: most tickets never see an LLM. Cost: cents.

Same accuracy. Same audit trail. Better sleep.

And in four weeks, the bill stops hiding.


The wrong default

Every AI-in-QA post I scroll past says some version of the same thing. Let AI write your tests. Let AI generate scenarios. Let AI explore your app. Let AI judge whether things look right.

It demos beautifully. It crashes on production.

The pattern: an agent reads raw Jira HTML, parses it, guesses element selectors, hallucinates the user’s intent, writes a test, runs it. Half the time the test passes for the wrong reason. The other half it fails because Playwright timed out on a popup the LLM didn’t know existed.

I built this stack three months ago. It worked on demos. It failed on real tickets.

Then I inverted the architecture. The map was upside-down.

Context before LLM

AI is the second step, not the first.

The first step is a deterministic engineering pipeline that hands AI clean, structured, verified context. The LLM never sees raw Jira. It never parses HTML. It never guesses CSS selectors. By the time the model gets a token, every input has been fetched, normalized, typed, and validated.

Think of it this way: AI as a senior developer doing code review on a well-prepared PR vs. AI as an intern told to “figure out the Jira.” Same model. Opposite results.

The principle: deterministic harness around the LLM, not the other way around. Code that yields the same output for the same input. Always. No probability. No vibes. The LLM is small, contained, and verifiable - one component in a system, not the system itself.

It’s a sandwich.

The sandwich and the math

Three layers:

  1. Deterministic input - parallel fetches from Jira, Figma, Playwright. Parsing, enrichment, typing. All pure code.
  2. AI decision - small, bounded scope. Model sees structured JSON, returns one of N validated verdicts.
  3. Deterministic output - typed object goes to a publisher that writes to Jira, attaches images, files an audit record.

The LLM sees maybe 5% of the pipeline’s scope. The other 95% is code with tests.

Here’s what that means at scale. Take a real e-commerce QA workload: 1000 tickets a month.

Scenario: 1000 e-commerce QA tasks/month

Naive "LLM everywhere":
- Avg 8K tokens input + 2K output per task
- $0.03/1K input + $0.15/1K output (Claude Sonnet)
- Per task: $0.24 + $0.30 = $0.54
- Monthly: $540 + retries + failures ~ $700-900

Deterministic-first (mine):
- ~70% tasks: zero LLM (static checks, pattern matching, CSS diff)
- ~30% tasks: LLM judgment only on prepared JSON context
- Avg 2K input + 500 output per LLM call
- Per task (LLM ones): $0.06 + $0.075 = $0.135
- Monthly: $0.135 x 300 = $40

Same accuracy. ~18x cheaper. Audit trail intact.

Take the table. Recalculate for your scenario. The math doesn’t lie.

The 70/30 split is a representative baseline, not a hardcoded SLA. In real production runs it moves with the workload - sometimes 60%, sometimes 80% exits without the LLM. Cosmetic regressions skew it toward more deterministic exits; ambiguous UX judgement skews it toward more LLM calls. Either way: the deterministic floor catches first, the LLM only sees what’s left.

Two notes if you’re on a Claude Max subscription

Right now, May 2026, Agent SDK calls and claude -p invocations share your Claude Code subscription window. So naive “LLM everywhere” doesn’t show up as a line item on your invoice - it shows up as your 5-hour window evaporating before lunch. The dollar math hides, but your daily productive hours don’t.

That changes on June 15, 2026.

Anthropic is splitting programmatic usage off into its own monthly credit, billed at full API rates. Claude Agent SDK, claude -p, Claude Code GitHub Actions, and third-party tools built on the Agent SDK all move to a separate budget at standard API pricing. Your interactive Claude Code stays on subscription. Your agentic pipelines move to pay-per-token.

So the cost table above stops being theoretical for anyone running Agent SDK-driven QA workflows. From June 15 forward, naive “LLM everywhere” is an explicit line item on your bill - and “deterministic floor first” stops being a nice-to-have. It becomes a forcing function.

Better to design for the deterministic floor now than to discover the bill in July.

The decision gate

Most QA decisions are decidable with deterministic checks - CSS diff against baseline, pixel diff with tolerance, accessibility regression detection. No interpretation needed. Pass or fail, with the reason.

Here’s the shape of the gate:

// Deterministic decision gate - most tasks exit here
// AI sees nothing. No tokens spent.

function decideQAPath(task: QATask): QAVerdict {
  // Layer 1: static deterministic checks
  if (task.cssBaseline && cssDiff(task.current, task.baseline) === 0) {
    return { verdict: "PASS", reason: "css-identical", llmNeeded: false };
  }

  if (task.visualBaseline && pixelDiff(task.snap, task.baseline) < 0.001) {
    return { verdict: "PASS", reason: "visual-identical", llmNeeded: false };
  }

  // Layer 2: structural checks (still deterministic)
  if (hasA11yRegression(task.aria, task.baselineAria)) {
    return { verdict: "FAIL", reason: "a11y-regression", llmNeeded: false };
  }

  // Only here LLM enters - structured JSON context already prepared
  return { verdict: "AMBIGUOUS", reason: "needs-judgment", llmNeeded: true };
}

Three deterministic checks. Three early exits. Token cost: zero. If all three pass through to the AMBIGUOUS verdict, the LLM gets called - but only on tasks that genuinely need judgment.

The majority of tickets exit before any model wakes up. The minority hit the LLM with a typed QATask object that someone or something has already validated. Tokens go toward judgment, not parsing.

Why this matters at scale

The headline number from my own production: 94 QA tickets processed in 2.5 days on a single project. Classic manual approach for the same scope: a full week of team work, easily two. With this stack: one engineer, half the week, full audit trail.

The other things that hold up at scale:

  • Deterministic exits keep the cost curve flat. You don’t pay tokens for clear-cut diffs.
  • Audit trail intact - every verdict has a reason code, every LLM call has its input bundle attached. You can replay any decision tomorrow.
  • Failure modes are bounded. A flaky test still fails - but it fails the same way every time, with the same logs, and you fix it once.

The deeper reason isn’t cost. It’s blast radius. When AI runs at the I/O layer, every Playwright flake compounds with every LLM hallucination. Failures stack. You can’t debug them, you can’t reproduce them, you can’t trust them.

When AI runs at the judgment layer only, with a deterministic floor underneath and a deterministic publisher above, failures are bounded. The line I keep coming back to: if your AI agent is doing more than judging, you have a deterministic floor problem.

What’s next

This is just the spine. The actual production system has ten specific architecture layers, each with its own pattern, its own gotchas, its own reason to be deterministic before the LLM enters.

Tomorrow I publish the full map. Ten layers, A through J. What goes in each. Which four I’m releasing as code drops over the next 8 weeks. Which six stay in pitch-mode for now and why.

If you’ve ever wondered why your “AI QA” demo worked but production crashed - tomorrow’s piece names every gap I had to fix.

It’s not magic. It’s a sandwich.