Part 1 framed the problem. Part 2 shipped the engine and the benchmark. Part 3 is the post you should read before deciding to adopt this - because the honest answer to “should I run jarvis-brain on my codebase” is “it depends on your codebase, and here is how to tell.”

Plus the V0.5 tier - what the same engine looks like when the unit of work stops being a repo and starts being a design system org with multiple consumer fronts.

When it pays off

The benchmark numbers from yesterday were averaged across fifty questions on a 5-repo monorepo. That word - monorepo - is doing real work. Brain wins by the largest margin on architecture and cross-repo questions, where the graph carries information the native tools have to reconstruct from scratch every time.

Concretely, the signals that say “this engine will pay off on your codebase”:

  • Three or more repos with shared code. A core package consumed by multiple consumer apps. A design system used by several frontends. Microservices that import from a common toolkit. The federation layer is where brain earns its place.
  • Tens of thousands of files or more. At under fifty files, Glob is fast and the model holds the whole thing in working memory. At fifty thousand, the model is doing exploration burn on every cross-cutting question.
  • Design system or shared primitives that get overridden per consumer. If your consumers are doing :root { --color-brand: ... } overrides on tokens defined in a shared library, you have a problem class that brain_ffcss was specifically built for.
  • Multiple engineers asking architectural questions of the same codebase. “Where do god-nodes live”, “what depends on this primitive”, “what gets touched if I change this contract”. These questions take fifty-three percent less wall time with brain than without.
  • Long-lived projects where exploration burn compounds. If your AI sessions consistently spend the first five minutes re-discovering the project layout, that is the thing being eliminated.

Brain does not eliminate the LLM’s job. It eliminates the part where the LLM has to re-derive structural facts from raw source on every question.

When Grep is already enough

The honest counterpart. The signals that say “you do not need this yet”:

  • Single repo under fifty files. Glob and Grep are already optimal here. The benchmark caught this directly - code discovery questions ran twenty percent slower with brain than without.
  • Solo developer working on a project they wrote. You have the structure in your head. The LLM is helping with execution, not navigation. The marginal value of pre-computed graphs is small.
  • Throwaway scripts, prototypes, exploratory notebooks. Anything you might delete in two weeks. The cost of building and maintaining the graph exceeds the cost of the exploration burn.
  • Codebases that change shape every week. The graph staleness becomes a maintenance burden of its own. Stick with what reads source on every query.
  • No cross-repo concern. If your work is bounded inside one repo and you do not need to trace anything across boundaries, the federation layer adds complexity without payoff.

The trap to avoid: adopting tooling because it is impressive in a benchmark, then paying maintenance cost on something you did not need. Engineering judgment here looks like “where does my time actually go when I work with the AI” - if the answer is mostly “writing code” and not “explaining the codebase to it”, you probably do not need graph-backed context. If the answer is the other way around, you probably do.

The daily use case - a multi-brand commerce setup

To make this concrete, here is the shape of the work where brain earns its keep daily.

The setup: one shared core package - components, composables, types, design tokens, the routing primitives. On top of it, five brand-variant frontends - same product family, different visual identities, different feature gating, different checkout flows per market. Each consumer overrides design tokens, adds brand-specific routes, plugs in market-specific payment integrations.

A typical question: “If I change the contract on useBaseCart in core, what breaks across the five fronts, and where would I need to update tests.”

Without brain: open core, find useBaseCart, read the signature. Grep for useBaseCart across all six repos. Open each hit. Read enough context to understand the call site. Decide if it breaks. Move to the next hit. Realistically: forty-five minutes of context-switching for a moderately complex change.

With brain: brain_explain on the node useBaseCart returns inbound edges - every consumer that depends on it, with the file and line of the call site, plus inferred contract usage. One tool call. The model has the full impact surface before it touches a single source file. The “what breaks” answer arrives in five minutes instead of forty-five. The remaining time is for actually writing the change and the test updates.

This is what the benchmark architecture-category numbers feel like in production. Fifty-three percent less wall time on the questions that previously felt like archaeology.

The V0.5 tier - federation at scale

The public repo is the engine. The private version of the engine includes the multi-tenant scaffolding. The V0.5 tier is the engine plus a different kind of capability on top - the one that matters when the unit of analysis is no longer “this repo” but “this organization’s design system and everything that consumes it.”

Three things the V0.5 tier adds:

Cross-repo deduplication. Run a WCAG audit, or a design token audit, or a contract audit across ten consumer repos at once. Findings get clustered: a aria-label problem that appears in nine of ten repos is one finding with ten manifestations, not ten findings. The cluster reports the canonical fix once and the affected consumers as a list. This changes review economics by an order of magnitude when you are managing a design system as a product.

Design system token federation. Tokens defined in the core library, overridden in consumers, sometimes overridden again in feature flags. The federation layer tracks the canonical definition, every override edge, every consumer that bypasses the system entirely (DRY violation). brain_ffcss already shows you this for one group; V0.5 makes it queryable across an organization.

Three audit modes. --scope component audits a single component across all consumers. --scope page audits a routing-level slice (the checkout page, across five fronts, comparing implementations). --scope full does the org-wide pass. The same engine, different traversal strategies, different result shapes. The point is that you ask the question in the shape that matches the work, not in the shape the tool happens to support.

This is the version that justifies a commercial engagement. Multi-tenant production deployment, cost optimization across hundreds of audit runs, integration with whatever ticketing and design tool the org already runs. The shape is open-source educational; the operational layer is not.

The three tiers

To make the offer surface crisp:

graph LR
    A[jarvis-brain-core<br/>Public AGPL-3.0] --> B[jarvis-brain<br/>Private full<br/>multi-tenant prod]
    B --> C[V0.5 Enterprise<br/>Cross-repo dedup<br/>Token federation<br/>3 audit modes]
  • Tier 1 - jarvis-brain-core (public, AGPL-3.0). The engine. Clone it, run it, learn from it. Build your own on top if you want.
  • Tier 2 - jarvis-brain (private, commercial). The production deployment. Multi-tenant, auth, webhook orchestration, cost tracking, alerting. Not open-source. Available as a commercial engagement.
  • Tier 3 - V0.5 Enterprise (commercial + design system focus). Tier 2 plus cross-repo federation, design system token tracking at organization scale, three audit modes. The version that makes sense for design system orgs with five or more consumer fronts.

If you run a setup that matches the “when it pays off” list - especially if you have a shared design system or core library consumed by multiple fronts - the productive conversation is at the tier 2 and tier 3 level. The public repo is enough to evaluate whether the method is real. It is not enough to run in production.

DM is open. The brief that helps me most: how many repos, what they share, where exploration burn currently eats your time. If you have those three numbers I can tell you within a conversation whether brain is worth pursuing for your context, or whether your bottleneck is somewhere else entirely.

sdet.it/services for the longer version of the offer.

Series wrap

Three days. One problem class. One engine. One benchmark. One enterprise tier.

What I tried to do across this series: show the actual decision process, not the polished outcome. Part 1 had the moment I almost shipped naive RAG and deleted it. Part 2 had the benchmark category where brain loses to Grep by twenty percent. Part 3 had the cases where you should not adopt this at all. The honest version of any architecture story includes the parts that did not work.

If you spent forty-five minutes on this series, here is what I hope sticks: graph-backed context is not a magic upgrade to your AI workflow. It is a specific solution to a specific class of problem - structural exploration on codebases big enough or distributed enough that the LLM cannot hold them in working memory. If you have that problem, the engine pays off measurably. If you do not, Grep is still the right answer.

Next week

Series #03 lands next Tuesday: context-first QA. The premise: most AI-in-QA content says “let AI write your tests.” I do the opposite - the AI never writes the tests, but it does almost everything else around them. Why that distinction matters, what it looks like in practice, and what the failure modes are.

#FromTheField - new series Tuesday morning.