Evals — turning AI from vibes into an engineering practice

Last week Google released Gemini 3.5 Flash. They published a benchmark table comparing it to Claude Opus 4.7 and GPT-5.5. Here are the results on a couple of benchmarks:

Benchmark
Gemini 3.5 Flash
Claude Opus 4.7
GPT-5.5

SWE-Bench Pro (agentic coding)	55.1%	64.3%	58.6%
Terminal-bench 2.1 (agentic terminal coding)	76.2%	66.1%	78.2%
MCP Atlas (multi-step workflows)	83.6%	79.1%	75.3%
MRCR v2 128k (long context)	77.3%	59.3%	94.8%

So which model is “better”? Depends entirely on what you’re building. If your agents do a lot of multi-step workflows, Gemini 3.5 Flash wins. If you care about long context retrieval, GPT-5.5 is miles ahead. If your work looks like SWE-Bench tasks, Opus is still the one.

The number that matters is the one that reflects your tasks — and that number doesn’t exist until you measure it yourself. That’s what evals are for.

“But how do we know it actually works?”

This is the question I hear most in training rooms, from developers who’ve played with AI tools but haven’t committed to using them in production: I don’t know how well this actually works, and I can’t ship something I can’t trust.

No two devices from the same production line are identical. Apple famously specified that the iPad Pro could bend up to 400 microns and still be within manufacturing tolerance — they published this during a PR crisis, essentially saying “yes it bends, and that’s by design.” Precision is an engineering parameter you dial up or down depending on the tradeoff you’re willing to make. The same applies to AI reliability.

At Google, as an SRE, we had SLAs — and when we were performing too well against them, the guidance wasn’t “great, keep going.” It was “you have error budget, use it.” Ship faster. Deploy more aggressively. Reduce latency. The error budget wasn’t a ceiling to avoid — it was a resource to spend deliberately.

The same mental model applies to AI. You’re not trying to get to 100%. You’re trying to know your number, understand what it costs you, and decide how to spend the gap.

LLMs are non-deterministic — but so are humans. Judges give harsher sentences before lunch than after. The question was never “does it always get it right.” It was always “how often does it get it right enough, and do I know when it doesn’t.”

Most teams can’t answer that. Not because the answer is bad, but because they’ve never measured. They shipped the feature, it seemed to work in testing, and now they’re crossing their fingers in production.

What evals actually are

A structured way to measure how your AI system behaves across a range of inputs — including inputs you didn’t think of when you built it.

Here’s how I run it in my GenAI training. We build a recipe chatbot — simple enough to complete in a day, complex enough to be interesting. Before anyone declares it “working,” we define simulated user personas and scenarios: the beginner cook, the person with dietary restrictions, the one who just wants to know what to make with whatever’s in the fridge. From those personas we generate synthetic queries — hundreds of them, automatically.

Then we evaluate manually. In almost every cohort the same thing happens: the chatbot handles straightforward requests fine, but the synthetic queries expose something nobody anticipated. Most participants modeled a single recipe in their data model — clean, simple, made sense at the time. But real users don’t ask for one recipe. They ask for a meal plan. They ask for a shopping list. They ask “what can I make this week that’s quick and vegetarian.” The data model was too restrictive for how people actually use a recipe assistant, and no amount of manual demo testing would have surfaced it.

A mismatch between what was built and what was actually needed. Found in an hour. In production, finding the same thing takes days — if you’re looking at all.

Once the manual eval is solid, we build an LLM-as-a-judge: an automated evaluator aligned to the manual scores. Multi-axis binary — separate yes/no judgments on each dimension. Did it answer the question? Was the recipe appropriate for the dietary restriction? Was the response the right length? A single score hides where things break. Multiple axes tell you which dimension is failing.

What evals actually solve

Once you have an eval suite, a lot of problems that felt intractable become straightforward.

Reliability. Does it work consistently, not just in the demo? With evals you have a number. 94% on your test set. 87% on edge cases. You know your floor, you know where it wobbles, and you can decide whether that’s good enough to ship.

Failure modes. Most teams find out about failure modes from support tickets. Evals move that discovery earlier, when it’s cheap to fix rather than after it’s already a reputation problem.

Capabilities. What it can actually do versus what you assume it can do. The recipe chatbot finding above is a capabilities finding — the system was capable of less than assumed, and the assumption was embedded in the data model.

Non-determinism. Run the same query a hundred times and measure the variance. Now it’s a number. Is 92% consistency good enough for your use case? Probably yes for a recipe assistant. Probably no for a medical triage tool. An engineering decision, not an anxiety.

Cost. Once you have evals, you can run them against cheaper models and find exactly where the cheaper model degrades. Maybe Haiku handles 80% of your queries just as well as Opus — but fails on the complex multi-step ones. Route intelligently and cut your bill without touching quality where it matters.

New model testing. Gemini 3.5 Flash came out last week. Should you switch? Run your eval suite against it. An hour later you have an answer specific to your use case, not a benchmark written by the model’s own creators.

Why most teams skip evals

The prototype worked fine. The demo went well. The stakeholders are happy. Adding evals feels like work on top of something that already seems to be working — and nobody asked for it in the spec.

The deeper problem is that unlike a bug, the absence of evals doesn’t break anything visibly. A missing semicolon fails loudly. An AI feature that works 78% of the time ships quietly, gets used quietly, and fails quietly — one bad response at a time, in front of real users, with no stack trace to point at.

There’s also a sequencing trap. Teams tell themselves they’ll add evals later, once the feature is more stable. But “more stable” never quite arrives, and by then the feature is in production, the synthetic data doesn’t exist, and building evals retroactively is harder than doing it upfront. The teams that get this right treat evals as part of the definition of done — same way you wouldn’t ship a backend endpoint without tests.

What a minimal eval setup looks like

Most teams imagine a months-long instrumentation project. A useful starting point has four parts:

A golden dataset. Fifty to a hundred representative inputs — real queries from users, or synthetic ones generated from personas. Not comprehensive, just representative.

Scoring criteria. For generative tasks, exact answers are too rigid. Define what “good” looks like on a few axes instead. Did it answer the question? Did it stay within scope? Was the tone appropriate? Binary per axis.

A way to run it. A script that feeds each input to the model and collects outputs. Keep it simple.

A score you can track over time. A single number — weighted average across your axes — that you can compare before and after a prompt change, a model update, or a new feature. If the number goes down, something regressed.

You can have this running in a day. Everything after — LLM-as-judge, automated pipelines, regression alerts — is an improvement on something that already works, not a prerequisite.

Evals aren’t glamorous, but shipping an AI feature without them means you don’t actually know what you built.

If your team is at that point — or you’re not sure where to start — hit reply. Depending on what you’re dealing with, it might be a conversation, a workshop, or something in between.

Your AI feature works. Prove it.