I spent the last stretch auditing performance on three ecommerce platforms - the kind with carts, variant selectors and payment steps, where a slow page is lost revenue. Then I distilled the method into a tool, pointed it at my own portfolio first, and it handed me a C.
Then I looked at the Core Web Vitals it measured. All green. LCP 991 ms. CLS zero. TTFB 42 ms. So which is it - slow, or fast?
That contradiction is the whole reason this tool shows two numbers where most tools show one. This is the field report on why.
The setup
The target is portfolio.sdet.it. It is about as thin as a website gets: static Astro, 3.3 KB of JavaScript, 21.7 KB total, six requests. No framework runtime shipped to the browser. No hydration. Nothing clever.
The tool is sdet-perf-toolkit v0.1 - a frontend performance audit that measures in a real browser first and lets AI interpret the measurement second. It runs Lighthouse five times in headless Chrome and takes the median, injects web-vitals through Playwright, reads the resource trace, and parses the bundle. The measured floor comes first. The AI never invents a number.
I ran it on my own site. Dogfooding before anyone else sees the tool. Here is the headline it produced:
Core Web Vitals: PASS (LCP good · CLS good · TTFB good · INP unmeasured)
Lighthouse perf: 72/100 (lab, throttled)
The measured Core Web Vitals behind that PASS:
| Metric | Value | Rating | Role |
|---|---|---|---|
| LCP | 991 ms | good | core |
| CLS | 0.000 | good | core |
| TTFB | 42 ms | good | supporting |
Pass on the vitals. 72 on Lighthouse. A grade of C if you collapse it to a single letter.
And then the part that matters. The tool grades five areas - bundle, runtime, network, SSR/hydration, assets. Every one of them came back A. Zero findings. Each.
Five A’s and a C. Same page. Same audit. Same second.
So which number is lying
Neither, exactly. They measure different things, and that is the point most people miss.
Core Web Vitals are what a real user feels and what Google ranks. Largest Contentful Paint at 991 ms means the main content paints fast. Cumulative Layout Shift at zero means nothing jumps around as it loads. Those are good, by Google’s own thresholds, with room to spare.
The Lighthouse performance score is a lab diagnostic. It is dominated by Total Blocking Time, and on this page TBT came back around 1740 ms. That single metric drags the composite score down to 72 even though every vital passes.
So the honest reading is: the site is fast for a user, and a lab metric is unhappy about something. A single C-grade would launder that nuance into a lie. A single 100 would launder it the other way. Two axes tell the truth.
But I wanted to know what the lab was unhappy about. A number you can’t explain is a number you can’t trust.
The dig
The trace collector exists for exactly this. It opens the page in Playwright, scrolls, waits for the page to settle, and reads the Resource Timing API plus a buffered long-task observer. Real timings, real render-blocking status, real main-thread attribution.
I ran it three times against the live site. Stable to within five milliseconds.
The 1740 ms of blocking time maps to a single long task of about 810 ms. Its attribution: unknown. That is the browser doing its own work - parsing, style, layout, paint. Not my JavaScript. Not a third-party script. There is barely any JavaScript to blame; the whole page ships 3.3 KB of it.
What actually happened is this. Lighthouse runs with 4x CPU throttling to simulate a mid-tier device. Under that throttle, the browser’s own parse-and-layout pass on the document stretches out far enough to register as a long task. TBT counts it. The score drops. A real user on a real machine never experiences this as a long task at all - which is exactly why the field vitals come back green.
There is an honest limit here, and I will name it. The Long Tasks API attributes self-work as unknown and won’t split it finer than that. Pinning it down to parse-versus-layout-versus-paint needs a full CDP performance trace, which is a heavier capture than v0.1 does. So I can tell you the long task is browser self-work under throttling. I can’t yet tell you which microsecond went where. That is a roadmap item, and I would rather say so than pretend the tool knows more than it does.
The lesson
Here is what this whole exercise is really about.
If my tool had only shown you “Grade: C”, you would have walked away thinking my portfolio is slow. It isn’t. Every vital that maps to user experience is green, and every area the tool can audit came back clean.
If it had only shown you “Lighthouse: 100” - which it doesn’t, but plenty of tools chase that number - you would have learned nothing about the TBT behaviour under throttling, which on a heavier site genuinely matters.
A performance score is not a performance verdict. The score is a useful lab signal. The vitals are the user truth. The area grades are where the work is, or in this case isn’t. You need all of them, and you need them kept apart, because the moment you average them into one letter you have thrown away the only information worth having.
That is why the headline is a split, not a grade. It is the one design decision in this tool I care about most, and my own portfolio is the proof. A site that scores C and deserves five A’s is the entire argument in a single screenshot.
| Area | Grade | Score | Findings |
|---|---|---|---|
| bundle | A | 100 | 0 |
| runtime | A | 100 | 0 |
| network | A | 100 | 0 |
| ssr-hydration | A | 100 | 0 |
| assets | A | 100 | 0 |
What’s under the hood
The measurement I have been describing is the deterministic floor: Lighthouse median-of-five, synthetic web-vitals, the resource trace, the bundle parse, plus a static source pass for the handful of anti-patterns you can catch without a browser at all. None of that involves a model. It is the same boring, reproducible measurement a careful engineer would do by hand - except it runs in two minutes and produces the same answer twice.
On top of that floor sit five AI specialists, one per area, each reading the measured output and the source to explain root cause and rank what to fix. On my portfolio they had nothing to say, because there was nothing to fix. On a site that is actually mis-built, they have plenty - which is what Part 2 is about.
That ordering is the thesis, and it has a name I keep coming back to: context before LLM. Measure first. Let the model interpret what was measured. Never let it guess the numbers. It is the same discipline that held up across three production ecommerce audits, where a wrong number doesn’t cost you a grade - it costs the client a checkout.
Tomorrow
Part 2: the architecture in full, and the thing I most needed to prove before trusting any of this - that the measured floor is stable. I ran the same audit twice, median-of-five each time, back to back. Identical score, identical grade, identical pass - the kind of stability you can put in front of a client. I’ll show you the numbers, the five specialists, and why a split headline beats a single grade once you’ve seen what a single grade hides.
Then Part 3: where the method actually came from - three ecommerce platforms, not a toy - and where the distillate ends and the real engagement begins.
#FromTheField