Your AI coding bill needs a plan

Don’t give the AI your credit card

Uber burned their entire 2026 AI budget. In Q1.

Their CTO went on record saying they’re “back to the drawing board” after a surge in Claude Code usage blew past every internal projection. Engineers were spending between $500 and $2,000 per person, per month. They gave thousands of engineers near-unlimited access to a powerful AI coding agent and watched a full year of budget vaporize in three months.

You might think: big company problem. Thousands of engineers, billions in R&D, someone forgot to set a limit.

But last month a developer at a 300-person company here in Oradea spent $1,000 on AI tokens in a single month. Their company now expects the Github Copilot bill to go up 3-4x following the recent pricing changes. Same problem, different scale, same absence of a plan.

This is the new normal. Eng tooling just became a salary-sized line item — and most teams don’t have a framework for thinking about it yet.

There are levers. Most teams aren’t using them.

The spend isn’t inevitable. But AI assisted coding is very new, so most developers don’t have a good model of how to control costs. Here are some knobs worth knowing about.

Model selection. A year ago I told every team I worked with: always use the best model, it’s worth it. I’ve updated that position. Haiku is now good enough for some basic things. Sonnet handles most implementation work well. Opus earns its price on architecture decisions, hard debugging, and anything where the cost of getting it wrong is asymmetric, such as security work.

Context management. Every message re-sends your full session context. A bloated CLAUDE.md, a dozen MCP servers you installed six months ago and forgot about, plugins that haven’t fired in weeks — all of it loads on every message, before you’ve typed a word. When was the last time you ran an audit on what your agent loads in its context window?

Orchestration design. Knowing when to pause and review versus when to let it run is a skill, and most teams haven’t developed it deliberately. An agent-review loop can burn hundreds of thousands of tokens. Three parallel subagents each allowed to retry five times on failure is fifteen expensive inference calls for a task that should have been one. Sometimes a senior engineer reviewing it themselves costs dramatically less — and catches more.

Scope discipline. Garbage in, garbage out — but more expensive. A vague prompt that sends an agent down the wrong path for twenty minutes is a cost problem as much as a quality problem. Spending half a (human) hour thinking hard about the task isn’t just good practice, it’s cost management as well.

The honest summary

The teams getting this right aren’t spending less — they’re spending intentionally. They know which workflows earn their cost and which ones are expensive procrastination. They have a rough model for what a task should cost before they run it, and they notice when the bill says otherwise.

Most teams don’t have that yet. And right now, without it, you’re either leaving productivity on the table or funding someone else’s Q1 budget story.

Is this a live problem for your team?

Depending on what you’re dealing with, it might be a conversation, a workshop, or something in between.

Local models are usually the wrong answer

The boring enterprise questions that actually matter

Vendor lock-in, legal review, and when local models actually make sense.

The boring enterprise questions that actually matter

Last issue I broke down the four things enterprises should evaluate when choosing AI coding tools: the model, the harness, the infrastructure, and the payment model — subscription, token-based, or per-request.

That last category is already shifting.

GitHub Copilot recently announced plan changes that remove most of the per-request subsidies, which makes the point about staying nimble more urgent than I expected when I wrote it.

This issue covers the three questions that slow enterprises down the most:

  1. How bad is vendor lock-in really?
  2. What does security and legal vetting actually involve?
  3. When do local models make sense?

Lock-in — less scary than you think, but not zero

It’s worth addressing directly: the risk is real, but it’s much smaller than most procurement teams fear.

The tooling itself is largely portable. A CLAUDE.md is a text file. Hooks and custom configurations take an afternoon to rewrite, and an agent can do most of it for you.

If you decide to move your team from Claude Code to Codex tomorrow, you’re not facing a months-long migration. You’re facing days of friction and 1-2 weeks of your developers recalibrating their prompting habits.

Where lock-in actually bites is in the contract, not the tooling.

A two-year enterprise agreement signed today is a bet that today’s best tool is still the right choice in 2027. Given how fast this space moves, that’s a significant bet.

Keep contracts short and exit clauses explicit — that’s the real mitigation, not worrying about whether your CLAUDE.md will port cleanly.

There’s a subtler version of the switching cost that’s worth understanding though.

Even if you stay with the same vendor, model updates can disrupt your workflow. When Anthropic shipped a new Opus version recently, many teams found it required a noticeably different prompting style to get the same quality of results — more explicit instructions, more courteous framing.

Same vendor, same harness, meaningfully different behaviour.

This isn’t unique to Anthropic — every major model update from any vendor carries some version of this friction.

GPT-5.5 came with extra goblins included. Really, it likes to talk about goblins a lot more.

So don’t just believe the marketing numbers that say a model is better on some benchmarks, but expect some friction until your devs get used to a new model.

Security and legal vetting — get the enterprise agreement, but be honest about what you’re protecting

A friend of mine recently told me about a developer at a large enterprise — locked-down corporate laptop, strict IT policy, no AI tools approved yet.

His workaround: taking photos of his screen with his phone and sending them to ChatGPT. This is what over-restrictive AI policies actually produce. Not compliance — creative non-compliance that’s far harder to monitor and control than just giving people proper access.

The most common pattern I see is a milder version of the same thing: developers have been using AI coding tools for six months before legal or IT finds out.

They signed up with a personal email, accepted consumer terms without reading them, and have been pasting internal code into a chat interface ever since. This is worth fixing, but not for the reasons most legal teams will tell you.

Consumer terms and enterprise agreements are meaningfully different. On a consumer plan, your prompts and the code you share may be used to improve the model. Data residency is typically unspecified and your request goes wherever there’s spare capacity. There’s no DPA, no contractual guarantees, no audit trail.

Enterprise agreements fix most of this: no training on your data, contractual data residency options, proper data processing agreements that actually hold up under GDPR scrutiny.

So yes, pay for the enterprise tier.

Don’t let developers run unsupervised on consumer accounts. That’s the straightforward advice. But be honest about what you’re actually protecting. Private equity firms have started doing technical due diligence differently — a standard web app that took a team six months to build can be reproduced over a weekend with agents.

Many acquisition deals have quietly fallen through because the technical moat turned out to be shallower than anyone admitted.

Code, increasingly, is not a valuable asset.

What is worth protecting is your data — customer information, internal architecture decisions, unreleased product details — and the domain knowledge embedded in your prompts and workflows.

Those go into the model context and that’s what an enterprise agreement actually safeguards.

The right frame isn’t:

“lock everything down to protect our IP.”

It’s:

“get the proper agreement so your data isn’t training someone else’s model, then get out of the way and let your developers use the tools.”

Enterprises that spend six months in legal review while competitors ship with agents aren’t protecting anything — they’re just falling behind.

Local models — usually the wrong answer to the right question

I’ve had a client running local models on four Mac Minis stacked in a server rack. It worked, technically.

It was available to two developers. Everyone else was still on API-based tools.

This is roughly the state of most enterprise local model deployments in practice: a proof of concept that never quite scaled, maintained by whoever set it up, running a model that’s two generations behind the frontier.

That’s not an argument against local models. It’s an argument for being honest about when they actually make sense.

The legitimate use cases are narrower than vendors selling local model infrastructure will tell you. Air-gapped environments — defence, critical infrastructure — have no choice.

Certain regulated industries where data genuinely cannot leave the building under any circumstances. Companies with IP sensitive enough that even a well-drafted enterprise agreement feels like too much trust. These are real, but they describe a minority of the enterprises currently evaluating local models.

The capability gap is closing but it’s not closed.

Models like Deepseek, Qwen and Codestral are genuinely impressive, and a year ago I wouldn’t have said that. But “genuinely impressive” and “as good as the frontier on a hard architectural problem” are still different things.

You’re trading capability for control, and that’s a legitimate trade — just make sure you’re making it consciously.

The hidden costs are also worth naming. Hardware is the obvious one.

Less obvious: you’re now responsible for updating the models yourself, running inference infrastructure, and absorbing the engineering overhead that a cloud API quietly handles for you.

Frontier model providers ship improvements constantly. On a local deployment you’re on your own cadence.

And there’s a longer-term risk that most local model strategies don’t account for. Open source models were released to build community and mindshare — and it worked spectacularly. Qwen alone accumulated nearly a billion downloads.

But once you have that mindshare, the incentive to keep releasing your best weights weakens.

Qwen’s flagship model is now API-only for the first time in the project’s history, while a capable but less powerful version remains open.

The frontier is closing, even among the labs that built their reputation on openness.

An enterprise strategy built around self-hosting the best available open model needs to reckon with the fact that “best available open model” may increasingly mean “second tier.”

For most enterprises the honest answer is this:

If you’re considering local models primarily because of data security concerns, an enterprise agreement with a proper DPA solves the same problem at a fraction of the cost and maintenance burden.

Local models are often a solution to a legal and procurement problem that has a cheaper answer. The exception worth considering is the hybrid approach: local for anything genuinely sensitive, API for everything else.

Classify your data, route accordingly. You get the security guarantees where they matter and frontier model capability everywhere else.

Before signing anything, three questions worth answering

Is your data actually sensitive enough to need a local model, or do you need a proper enterprise agreement?

Is this a two-year contract in a market that changes every six months?

Who internally owns the renewal decision — and do they know enough to make it?

This is also the kind of decision-making I help teams work through in AI-assisted development trainings: not just which tools to use, but how to adopt them without creating chaos.

Hit reply if you’re navigating any of this inside your team.

I read everything.

Choosing AI coding tools is not really about the tool

Choosing AI Coding Tools Without Regretting It Six Months Later

The AI coding space is moving at an uncomfortable pace.

Even as an AI consultant who tracks this full time, I can’t keep up with every tool that launches. Today’s best model is from Anthropic. Next week it might be OpenAI. The week after, Google surprises everyone.

This makes buying decisions genuinely hard for enterprises — you’re not buying stable software, you’re placing a bet on a moving target.

Here’s how I think about the decision.

1. The model

Anthropic has the Claude series, OpenAI has GPT, Google has Gemini. Each family gets meaningful updates every few months, and which one leads on any given benchmark shifts constantly.

More importantly, models aren’t uniformly good.

Some are better at generating new code, some at finding bugs, some at reasoning through long-running autonomous tasks. This usually reflects where that company focused its training efforts in the last cycle — which means the rankings shift as priorities shift.

Don’t pick a model based on a benchmark from three months ago.

2. The harness

The harness is how your team actually interacts with the model — an IDE integration, a terminal agent, a chat interface.

This matters more than most people realise.

Anthropic’s models have been specifically optimised for use inside Claude Code, and they perform measurably better there than when accessed through a generic wrapper. Other models are trained more broadly and don’t have a preferred harness.

Some harnesses give models access to more tools — file editing, terminal execution, web search — and this directly affects what they can accomplish on real tasks.

The practical implication: if your team builds workflows, hooks and institutional knowledge around a specific harness, that investment doesn’t transfer easily.

Lock-in at the harness level is often a bigger risk than lock-in at the model level.

3. Infrastructure and data residency

Where is the model running, and where is your data going?

Claude is available directly through the Anthropic API and through all major cloud providers. Gemini is Google-only. Some APIs let you specify that requests stay in Europe — important for GDPR compliance.

Others route to wherever spare capacity exists, with no guarantees. This is not a detail to sort out after you’ve signed a contract.

For regulated industries or anything involving sensitive data, data residency needs to be a first-order requirement, not an afterthought.

4. Payment model

This is the decision most teams get wrong, and it has long-term consequences. Subscription gives you predictable costs but unpredictable performance.

Providers have every incentive to quietly degrade quality during peak periods — you’ve already paid. Subscription pricing is also heavily subsidised right now. When that subsidy has to give way to sustainable unit economics, the price will look very different.

Per-request pricing, as used in GitHub Copilot, is conceptually tidy but practically broken. Requests vary enormously in complexity. Pricing them uniformly means either the provider loses money on hard tasks or you overpay on easy ones.

I don’t see this surviving long-term.

Token-based pricing — you pay for exactly what you use — is the most transparent and the most portable. It gives you access to any harness, any model, through APIs or aggregators like OpenRouter.

It’s also the most expensive at face value, though often cheaper than subscription once you account for actual usage patterns.

The practical advice

Don’t make a five-year platform decision in a market that changes every five months.

Run experiments across different teams, don’t sign contracts that are hard to exit, and plan explicitly to revisit the decision every six months. Build that review cadence into the rollout, not as an afterthought.

Next issue I’ll cover the questions this one doesn’t answer: security and legal vetting, the lock-in risk in more depth, and when local models actually make sense for enterprise teams.

If you want to go deeper than tool selection, I also run training programs for development teams on AI-assisted coding, agentic workflows, and how to actually integrate these tools into day-to-day engineering work.

The goal isn’t to chase every new model launch — it’s to help teams build a practical workflow they can trust.

If your team is currently evaluating AI coding tools, hit reply and tell me where the decision is getting stuck — model choice, security, cost, or developer adoption.

How I Write Software with LLMs

How I Write Software with LLMs

Over the last year, I’ve written more than 100,000 lines of code using AI. I’ve landed on a workflow I’m genuinely happy with — both in how it feels to use and in the quality of the resulting code.

Most people I see either:

  • give a vague prompt, get a disappointing result, and give up
  • or go the other direction and build complex orchestration pipelines with a dozen moving parts that are too unreliable to trust

This is what works for me in between those two extremes.

The tools

My main driver is Claude Code with Opus, unless the task is small (roughly under 100 lines), in which case Sonnet is fine.

For a second opinion — and for certain tasks — I use OpenAI’s Codex on GPT-5.4 at xhigh reasoning.

Using two models deliberately isn’t redundancy.
They have different “personalities” and catch different things.

Start with clarification, not code

Whenever I start a new session, I describe what I want — the rough feature outline, what’s in my head — and then I explicitly tell the model to ask me questions about anything unclear.

This step is non-negotiable.

Skip it, and the model will silently make assumptions at every ambiguous decision.
Run it again, and it will make different assumptions.

You’ll get code that works — but isn’t quite what you wanted — and you won’t immediately know why.

I iterate on this several times, asking for more questions until the specification is nailed down.

The more precise the spec, the more mechanical the code generation becomes. That’s the goal.

Planning large features

For anything substantial — a new section of the app, a significant feature — I use plan mode.

The model:

  • reads through the codebase
  • identifies existing patterns
  • produces a detailed plan before writing a single line of code

Claude’s plans tend to be very explicit: endpoints, data shapes, what gets touched and why.

Once I have that plan, I pass it to Codex and ask it to critique.

Codex is more nitpicky and tends to catch smaller issues Claude glosses over.

I take those suggestions back to Claude, do a few revisions, and only once everything checks out does Claude write the actual code.

Reviewing the output

I don’t read every line — that’s not realistic at scale.

Instead, I take a high-level pass:

  • which files were touched
  • what dependencies were introduced
  • whether existing code was reused

This alone catches a surprising number of mistakes.

If anything looks off, I flag it immediately.

Then I send the code to a different model for review:

  • if Opus wrote it → Codex reviews it

Or I open a pull request on GitHub and let the existing review tools do their pass.

Debugging

For gnarly bugs — not new features, but things that are genuinely broken — GPT-5.4 on xhigh is my first call.

It’s persistent in trying different approaches and has a high hit rate on hard problems.

The part most people skip

All of this only works because of the upfront specification work.

The advantage is that you can stress-test a spec from every angle before writing any code:

  • ask the model to rewrite it from a different perspective
  • challenge your assumptions
  • find edge cases

This costs almost nothing.

And once the spec is solid, turning it into working code becomes almost trivial by comparison.

Why I don’t automate everything

I could automate more:

  • trigger reviews automatically
  • chain agents together
  • remove myself from the loop

I’ve chosen not to.

Manual review checkpoints mean I still understand what’s being built.

That matters — both as an engineer and when I’m explaining these workflows to teams.

What does your workflow look like?

I’m especially curious whether anyone has found a reliable way to handle the spec phase without going back and forth as many times as I do.

What to look at when choosing AI tools for your team

The AI coding space is moving at an uncomfortable pace. Even as an AI consultant who tracks this full time, I can’t keep up with every tool that launches. Today’s best model is from Anthropic. Next week it might be OpenAI. The week after, Google surprises everyone.

This is a challenge for large companies, because they are used to more stability. But here are some things they have too look at when making decisions in this space:

1. The model

Anthropic has the Claude series, OpenAI has GPT, Google has Gemini. Each family gets meaningful updates every few months, and which one leads on any given benchmark shifts constantly.

More importantly, models aren’t uniformly good. Some are better at generating new code, some at finding bugs, some at reasoning through long-running autonomous tasks. This usually reflects where that company focused its training efforts in the last cycle — which means the rankings shift as priorities shift.

Don’t pick a model based on a benchmark from three months ago.

2. The harness

The harness is how your team actually interacts with the model — an IDE integration, a terminal agent, a chat interface. This matters more than most people realise.

Anthropic’s models have been specifically optimised for use inside Claude Code, and they perform measurably better there than when accessed through a generic wrapper. Other models are trained more broadly and don’t have a preferred harness. Some harnesses give models access to more tools — not just file editing, terminal execution, web search — and this directly affects what they can accomplish on real tasks.

The practical implication: if your team builds workflows, hooks and institutional knowledge around a specific harness, that investment doesn’t transfer easily. Lock-in at the harness level is just as big of a risk as lock-in at the model level.

3. Infrastructure and data residency

Where is the model running, and where is your data going?

Claude is available directly through the Anthropic API and through all major cloud providers. Gemini is Google-only. Some APIs let you specify that requests stay in Europe — important for GDPR compliance. Others route to wherever spare capacity exists, with no guarantees.

For regulated industries or anything involving sensitive data, this is the first requirement that needs to be met, not an afterthought.

4. Payment model

There are three main payment models:

Subscription gives you predictable costs but unpredictable performance. Providers have a perverse incentive to quietly degrade quality during peak periods, either by serving a quantized model, or by changing the default thinking budget. Subscription pricing is also heavily subsidised right now. When that subsidy has to give way to sustainable unit economics, the price will look very different.

Per-request pricing, as used in GitHub Copilot, is conceptually tidy but practically broken. Requests vary enormously in complexity. Pricing them uniformly means either the provider loses money on hard tasks or you overpay on easy ones. I don’t see this surviving long-term.

Token-based pricing — you pay for exactly what you use — is the most transparent and the most portable. It gives you access to any harness, any model, through APIs or aggregators like OpenRouter. It’s also the most expensive one (Uber went through their whole budget for 2026 in just Q1), but many companies find that it still gives them a very good ROI.

The practical advice

Don’t make a five-year platform decision in a market that changes every five months. Run experiments across different teams, don’t sign contracts that are hard to exit, and plan explicitly to revisit the decision every six months. Build that review cadence into the rollout, not as an afterthought.

Next issue I’ll cover the questions this one doesn’t answer: security and legal vetting, the lock-in risk in more depth, and when local models actually make sense for enterprise teams.

Is your team navigating any of these decisions right now? Hit reply — I’m curious what’s causing the most friction.

Most people use AI coding tools the wrong way

How I Write Software with LLMs

Over the last year I’ve written more than 100,000 lines of code using AI. I’ve landed on a workflow I’m genuinely happy with — both in how it feels to use and in the quality of the resulting code.

Most people I see either give a vague prompt, get a disappointing result, and give up — or go the other direction and build complex orchestration pipelines with a dozen moving parts that are too unreliable to trust. This is what works for me in between those two extremes.

The tools

My main driver is Claude Code with Opus, unless the task is small (roughly under 100 lines), in which case Sonnet is fine. For a second opinion and for certain tasks, I use OpenAI’s Codex on GPT-5.4 at xhigh reasoning.

Using two models deliberately isn’t redundancy — they have different personalities and catch different things.

Start with clarification, not code

Whenever I start a new session, I describe what I want — the rough feature outline, what’s in my head — and then I explicitly tell the model to ask me questions about anything unclear.

This step is non-negotiable. Skip it and the model will silently make a guess at every ambiguous decision. Run it again and it’ll make different guesses. You’ll get code that works but isn’t quite what you wanted, and you won’t immediately know why.

I iterate on this — asking for more questions several times — until both of us feel like the specification is nailed down. The more precise the spec, the more mechanical the code generation becomes. That’s the goal.

Planning large features

For anything substantial — a new section of the app, a significant new feature — I use plan mode. The model reads through the codebase, identifies existing patterns, and produces a detailed plan before writing a single line of code. Claude’s plans tend to be very explicit: endpoints, data shapes, what gets touched and why.

Once I have that plan, I pass it to Codex and ask it to critique. Codex is more nitpicky and tends to catch smaller issues Claude glosses over. I take those suggestions back to Claude, do a few revisions, and only once everyone agrees does Claude write the actual code.

Reviewing the output

I don’t read every line — that’s not realistic at scale. Instead I take a high-level pass at which files were touched. This alone catches a surprising number of mistakes: code that didn’t reuse something already written, or changes that introduced cross-module dependencies that shouldn’t exist.

If anything looks off, I flag it immediately. Then I send the code to a different model for review — if Opus wrote it, Codex reviews it. Or I open a pull request on GitHub and let whatever review tool is set up on that project do its pass.

Debugging

For gnarly bugs — not new features, but things that are genuinely broken and require real digging — GPT-5.4 on xhigh is my first call. It’s persistent in trying different approaches and has a noticeably high hit rate on hard problems.

The part most people skip

All of this only works because of the upfront specification work. The cool thing is that you can stress-test a spec from every angle before writing any code — ask the model to rewrite it from a different perspective, challenge your assumptions, find edge cases. This costs almost nothing. And once the spec is solid, turning it into working code is almost trivially easy by comparison.

I could automate more — trigger reviews automatically, chain agents together, remove myself from the loop. I’ve chosen not to. Manual review checkpoints mean I still understand what’s being built. That matters to me both as an engineer and when I’m explaining these workflows to teams.

What does your workflow look like? Reply and let me know — I’m especially curious whether anyone has found a reliable way to do the spec phase without going back and forth as many times as I do.

Most engineers won’t like what’s coming

What will the future software engineer do?

Writing code always took a surprisingly small amount of time of a software engineer. Maybe as a junior you’d spent a majority of your time writing code, but for senior people, it was maybe 30% of the time actually typing into an editor. The rest of the time would be spent doing architecture work, code reviews, mentoring and some testing.

Maybe this mismatch between expectations and reality is why software engeering degress have some of the highest rates of attrition among technical degrees. When starting coding, you get dopamine hits from writing all those lines of code, but then you do it less and less.

With agentic AI for coding this number will go down even more. I’m already hearing of teams where 90% of code is written by AI. In 10 years, we will have less code written by humans, than we still have hand written assembly today.

Some of this time will be used for tasks that until now were too much effort so they were done rarely: cleanups, performance tests, increasing test coverage, better automation and so on. This is the fun part: you get to do more proper engineering.

But a lot of the time will go towards QA like activities. Previously, when writing code, you run it often and you get a lot of feedback on whether the code works or not. But now you just get a notification that the agent finished working and the code is ready to review. Reading carefully thousands of lines of codes is impossible, so most people skim it to find blatant mistakes. And then you do lots of testing. Does this program actually behave the way you thought it should?

And many times, you will realize you underspecified the prompt and you need to go back and clarify things. Other times, by testing it you realize you actually need to do something else. Sometimes the existing code can be modified, but sometimes this means throwing it away and starting over.

And this is a very different job than what people signed up for years ago. If you get your dopamine hit from solving problems, it’s still good. I’ve talked to engineering managers who tell me that they are finally back in “coding” because of agents and they are really enjoying it. But if the dopamine hit was from writing code and wrangling compiler errors, people will not like this.

Which are you — the problem-solving dopamine hit, or the writing-code dopamine hit? Does using agents increase your enjoyment of the job or is it a source of frustration?

GitHub Copilot vs Cursor vs Claude Code — what actually works?

GitHub Copilot vs Cursor vs…

GitHub Copilot was the first tool to use AI to help with coding, back in the smart autocomplete era. It took me a while to start using it — how can a machine write code better than me? But after a friend strongly recommended it, I fell in love with it.

Then other competitors started appearing. They kept adding new features to Copilot, but spread themselves too thin — too many weird features (look for the sparkle button in random places and try to guess what it does). As a result, they didn’t do any one thing particularly well. It also took them a looong time to add a CLI interface.

AIDER was one of those early tools. It was the first CLI AI tool. It was clunky — models were not as good back then. But it showed the first signs of being able to do autonomous edits. It got things working, but the code quality wasn’t the best. And being first in a domain comes with a cost — they were stuck with architectural choices that didn’t scale well, eventually lost relevance, and development stopped.

At some point, Cursor became the highly favored startup. It works really well — but it required using their IDE. Back then, AI agents were not good enough to replace an IDE, and for Python, nothing beats PyCharm.

Then Claude Code appeared. I remember blowing $15 in a couple of hours using it. Initially, it felt like a slot machine — will I get good code this time? But what kept me using Claude Code is the strong integration between the model and the environment. Anthropic builds both, so Claude models work best there, because they’re trained to use it. The same model in another app performs slightly worse.

I also use Codex (mostly when I run out of my Claude subscription). It’s pretty good, but it has a colder personality, so I don’t enjoy it as much.

I’ve tried Google’s tools too. First off, Gemini has a somewhat depressive personality (if it keeps failing at a task, it might say it will delete itself — https://www.forbes.com/sites/lesliekatz/2025/08/08/google-fixing-bug-that-makes-gemini-ai-call-itself-disgrace-to-planet/). And Antigravity (their IDE) is an example of poor vibe coding. Unusable.

What about you — which AI coding tools have you actually found useful so far?
If this was helpful, feel free to share it with someone who’s exploring this space.

Optimizing performance in Qdrant

A while go someone asked me some questions about Qdrant and how to optimize it’s usage for use case that they were having separate document sets for each “client”. When doing searches, they wanted to search only the documents belonging to the client doing the search.

One of the things that we discussed was whether it’s better to have a single collection for all documents and use a field “client_id” for filtering the results, or to use a separate collection for each client.

So I wrote a quick benchmark for this (with my friend Claude of course) to compare these scenarios. In the case of single collection, I tested both without an index and with a keyword index on the “client_id” field.

ConfigurationSearch Time (ms)Standard Deviation (ms)
Separate Collections3.84.3
Single Collection (no index)35.84.8
Single Collection (keyword index)31.68.2

Turns out using separate collections is much faster and this holds even for much larger values of users (I tested up to 2000).

Why could that be?

  • No Filtering Overhead: When using separate collections, there’s no need to filter results by user_id – you’re already querying the correct subset of data.
  • Smaller Search Space: Each collection contains only the vectors for a single user, so Qdrant needs to scan through less data during the search.
  • Better Cache Utilization: With separate collections, the index structures for frequently accessed users are more likely to stay in memory.

Of course this isn’t a very comprehensive benchmark. There are many other options you can try out, such as the quite recently introduced tenant index. And having things in a single collection has some other advantages, including some operational ones. Reach out to me if you need help with managing Qdrant for your RAG use cases.

You can find the code to reproduce this in this repo.

Private and secure alternatives to ChatGPT

Everyone is hyping up GPT-4, and it’s true that it’s currently the best publicly available model. However, numerous open-source models are available that, when well utilized, can perform impressively using significantly fewer resources than GPT-4 (which is actually rumored to be a combination of eight models).

Recently, I completed a ‘talk to your document’ project for a client. There’s no shortage of startups doing this, but this client had an extra security and privacy requirement – their data could not leave their network. Thus, all processing had to happen on-premise. I informed them upfront that the inability to use GPT-4 might result in less accurate results, but they were willing to make that trade-off.

To my surprise, some open-source models proved to be extremely effective for this use case. Specifically, I created the embeddings using DistilBERT models trained on the MS Marco dataset, with FastChat-T5 as a Language Model (LM) for formulating answers.

The resulting system performed exceptionally well. The client was delighted with the performance, and importantly, the entire setup remained on-premise with no data leaving their infrastructure. Also, I was very pleasantly surprised by FastChat, which is a 3 Billion parameter model, but still answers very coherently, while being fast enough to run on a (beefy) CPU only instance!

While GPT-4 is a remarkable model, for companies with high security requirements, there exist various viable alternatives. Despite different trade-offs, these models can still provide excellent performance across a variety of tasks, and I can help you navigate those tradeoffs.

Reach out to me if you would like to have a private and secure “talk to your document” style app for your company!