Performance audit, Part 2: A deterministic floor you can trust

The hard part of a performance tool is not the AI. It is getting numbers you can trust without a human pasting them in.

Every performance audit I have ever shipped started the same way: open the page, run Lighthouse, copy the numbers into a document by hand. The measurement was real, but the workflow was manual, and a manual workflow doesn’t run in CI and doesn’t run twice the same way. The whole point of building a tool was to make the measurement automatic without making it flaky.

So before I trusted a single finding, I had to prove one thing: that the floor is stable.

What the floor is

Five sources, all deterministic, none of them a model.

Lighthouse runs five times in a fresh headless Chrome each run, and the tool takes the median. Fresh Chrome per run, so no warm-cache bias leaks between them. Five runs because a single Lighthouse shot has real run-to-run variance and a single shot is how you end up with a grade that changes every time you look at it.

web-vitals runs through Playwright with a scripted interaction, producing LCP, CLS, TTFB and INP from the live page. I label these synthetic-field, not field, because they come from a scripted browser and not a real user. The distinction matters and I keep it visible.

The resource trace opens the page in Playwright, scrolls, lets it settle, and reads the Resource Timing API plus a buffered long-task observer - per-request sizes and timings, render-blocking status straight from Chromium, third-party origins, and main-thread blocking attribution.

The bundle analyzer parses build stats from Vite, Rollup or nuxi analyze and scans dependencies without installing anything - chunk sizes, duplicate versions, declared-but-unused packages.

And a static source pass catches the handful of anti-patterns you can find without a browser at all: an image with no dimensions, a deep watcher, runtimeCompiler: true. Definite things. No interpretation needed.

That is the floor. It is the boring part. It is also the part everything else stands on, which is why it had to be proven first.

The variance proof

I ran the full audit on portfolio.sdet.it twice. Median-of-five each time. Back to back. Here is what came out.

	Audit #1	Audit #2	Δ
Lighthouse score	72	72	0 pts
CWV verdict	PASS	PASS	same
Area grades	all A	all A	same
LCP median	991 ms	926 ms	65 ms
Grade	C	C	same band

The same two runs, laid out the way the tool prints them:

Run-to-run variance - https://portfolio.sdet.it
(Lighthouse median-of-5 + headless Chromium CWV, two back-to-back runs)

                        Run #1        Run #2
  Lighthouse perf       72/100        72/100
  LCP (lab)             991 ms        926 ms
  CLS (lab)             0.000         0.000
  TTFB                  42 ms         34 ms
  Core Web Vitals       PASS          PASS

  bundle                A (100)       A (100)
  runtime               A (100)       A (100)
  network               A (100)       A (100)
  ssr-hydration         A (100)       A (100)
  assets                A (100)       A (100)

The verdict is rock stable where it counts: identical Lighthouse score, identical PASS, identical area grades, identical letter grade. The raw LCP drifted 65 ms between runs - and that is the honest, interesting part. The underlying metric moves a little run to run, as it always does on a live network. Both values sit comfortably inside Google’s “good” band, so the verdict never wobbles.

That 65 ms is exactly why the tool runs five Lighthouse passes and takes the median instead of trusting a single shot. A single shot catches whichever number the network handed you that second. The median is what lets the grade hold steady while the raw metric breathes - and it is the difference between a number you can put in front of a client and a number you have to apologise for.

Twelve Lighthouse runs across the two audits. Zero timeouts. Zero crashes. Deterministic medians every time.

The thesis of this entire tool - context before LLM - rests on this table. If the measured verdict wobbled, then everything the AI says on top of it would be built on sand. It doesn’t. So the AI gets to speak.

The five specialists

On top of the floor sit five AI specialists, one per area: bundle, runtime, network, SSR/hydration, assets. They run in parallel - a single dispatch, five agents, each handed its slice of the measured floor plus the source it needs to read.

Each one has a focused job and a Vue/Nuxt idiom behind it, because the method came from auditing Vue and Nuxt ecommerce - three production platforms - not from a generic checklist. The anti-patterns the specialists hunt are the ones that actually bit real carts and product listings, not textbook examples.

Bundle reads chunk sizes and dependency graphs - code-splitting gaps, duplicate dependency versions, packages shipped but never used.

Runtime looks at main-thread cost and the Vue-specific traps: deep watchers, inline style object literals inside a v-for that destabilise props on every render.

Network reads the resource trace for the waterfall - sequential awaits that should have been parallel, payloads fetched without field filtering, third-party origins.

SSR/hydration looks at TTFB, hydration strategy, route rules, island boundaries.

Assets handles images, fonts, the LCP element, icon-set bloat.

The crucial thing: the specialists read the measured output. They do not generate metrics. The number came from Lighthouse and the trace. The specialist explains why the number is what it is and what to do about it. Measurement is fact. Interpretation is the model’s job. The two never blur.

The overlap problem, and the protocol

Five specialists looking at one page will step on each other. A slow hydration pass shows up as a runtime cost, a network delay and an SSR issue all at once. Left alone, you get the same root problem reported three times with three different owners, and a findings count that lies by inflation.

SSR/hydration is the worst offender - it shares a seam with all four others. This is the same double-report risk I hit building the WCAG toolkit, where a modal focus trap and a keyboard handler kept claiming the same finding.

The protocol is simple and deterministic: the collector that surfaced a finding decides its owner. If the trace surfaced it, it belongs to whoever owns that signal, not to whoever else could plausibly claim it. On top of that, a deterministic dedup key collapses genuine duplicates - same check, same file and line, or same metric and route. Two specialists can both notice a problem; only one finding survives, attributed once.

It is not glamorous. It is the difference between a report a developer trusts and a report they argue with.

Why the headline is a split, not a grade

Part 1 showed my portfolio scoring C with every Core Web Vital green and every area graded A. That contradiction is not a bug to paper over. It is the reason the headline is built the way it is.

Performance has no single axis. Core Web Vitals are user and business truth. The Lighthouse score is a lab diagnostic. The area grades are where the work is. Those are three different questions, and a single letter answers none of them honestly - it averages them into a number that is wrong in a specific, misleading way.

So the headline reports two axes:

Core Web Vitals:  PASS / FAIL   (per-metric good / needs-improvement / poor)
Lighthouse perf:  NN/100        (lab, throttled)

The A-to-F grade did not get thrown away. It moved down a level, to per-area grades, where a single axis actually applies - because “how is the bundle doing” is a question with one answer.

And there is a third state that most tools skip: unmeasured. Run the tool on source only, with no URL to hit, and it cannot honestly say PASS or FAIL on vitals it never measured. So it says unmeasured, and hands back a provisional findings-grade with a banner that says exactly that. A tool that prints a confident grade for a measurement it never took is a tool lying to you politely. This one refuses to.

The honest CI reality

A median-of-five audit with throttling and web-vitals takes about two to three minutes per URL. That is fine for a CI sample on a representative page. It is also exactly why v0.1 audits one route and not the whole site - N routes is N times that cost, and multi-route is a deliberate later-version problem rather than a thing I pretended to solve now.

One real gotcha worth naming for anyone who clones this: Lighthouse finds Chrome through chrome-launcher, which on my machine discovered the system Chrome. A clean Linux CI runner has no system Chrome. The fix is to point Lighthouse at the Chromium that Playwright already installed, which is roughly 150 MB and already there. I would rather tell you that up front than let you hit it on your first green-to-red CI run.

Tomorrow

Part 3: where this method actually came from. Not a toy site - three ecommerce platforms, the kind with carts and payment steps and a hundred routes. I will show the tool on a deliberately broken demo so you can see it find things instead of finding nothing, walk through the honest line between the public distillate and the production engagement, and explain why /perf:fix will guide you through a fix but won’t pretend to auto-apply the architectural ones.

#FromTheField

What the floor is

The variance proof

The five specialists

The overlap problem, and the protocol

Why the headline is a split, not a grade

The honest CI reality

Tomorrow

Related

Performance audit, Part 1: My own tool gave my portfolio a C

Performance audit, Part 3: Where the method scales

Stop CC From Burning Tokens on Grep/Glob