Two problems blocking autonomous AI coding assistants

Published: Mar 30, 2026

So I’ve been using AI coding assistants daily for a while now. Claude Code, GitHub Copilot, Codex - I’ve tried them all. And I keep running into the same wall.

Every time I use agentic coding, I end up doing two things manually: First, I verify that the code actually does what I expected - running tests, checking behavior. Second, I read through the changes to understand how the shape of the codebase changed.

So by using AI agents, my job as a developer is reduced to writing unit tests and doing code reviews. You know, the two things developers love the most ;-)

These two manual steps eat into the productivity gains. And until we solve them, we can’t truly let the agents loose.

You’ve probably heard the optimistic takes. AI is just another step up in abstraction, like moving from assembly to C, or from C to Python. Each generation of abstraction has made us more productive by letting us think at higher levels while the machine handles the details.

But there’s a fundamental difference this time.

The determinism problem

This is why I have to manually verify everything the AI produces.

Every previous step in abstraction - from assembly to compilers to interpreters - maintained one crucial property: determinism. If you write the same C code twice, you get the same machine code. If you run the same Python script twice, you get the same output.

AI coding assistants break this rule. Ask an AI to implement a feature twice, and you’ll get two different implementations. Sometimes subtly different, sometimes radically different.

This isn’t a bug. It’s fundamental to how these systems work. And it’s why I can’t just trust the output - I have to verify it every single time.

Solution space: Guardrails everywhere

So how do we reduce this manual verification burden?

The obvious answer is what we’ve always known works: comprehensive testing and guardrails. If I have good tests, I don’t have to manually verify behavior - I run the tests and they tell me if the AI broke something. Linting catches style violations. Type systems catch errors early.

The framework from Growing Object-Oriented Software, Guided by Tests - with its outer loop of BDD and inner loop of TDD - becomes more relevant than ever. We define the expected behavior clearly enough that the AI can’t stray too far from what we want.

But here’s the uncomfortable truth: if the solution to AI-assisted coding is “write more tests” - the thing developers notoriously avoid - then we have a problem. We need better answers than just “do the thing you hate, but more of it.”

Maybe the AI should help write the tests. Maybe we need new kinds of verification that don’t require as much manual effort. Maybe property-based testing or formal verification becomes more practical. Or maybe there’s a solution we haven’t thought of yet.

I don’t have the answer. I’m identifying the problem.

The cognitive load problem

But even with good tests and guardrails, I still have to read through what the AI produced. This is the second manual step - understanding how the shape of my codebase changed.

AI doesn’t generate code like humans do - a few lines here, a small refactoring there. AI can rewrite entire files, touch dozens of files in a single session, generate hundreds of lines of code in seconds.

Looking at massive code diffs will fry your brain. You simply cannot keep it all in your head. And yet I need to understand what changed - did the AI introduce new dependencies? Did it change the architecture in ways I didn’t expect?

Traditional line-by-line diffs don’t scale for this. I need new ways to understand whether the change has the shape I expected.

A new way to view changes

So how do we reduce this second burden - understanding what changed without reading every line?

What if we could view changes at different levels of abstraction? Not line-by-line diffs, but architectural changes.

Simon Brown’s C4 model - a way to visualize software architecture at four levels of abstraction (Context, Containers, Components, Code) - might offer an answer. Imagine seeing a visual diff of your architecture after the AI made changes. You could look at it and think: “That’s odd - I wouldn’t have expected changes in this component. I expected the changes to be over here.”

And then you could have a conversation with the AI about it. Ask why it touched that component. Understand the reasoning. Maybe it found a better approach. Maybe it misunderstood the task.

This pushes us toward systems thinking - I verify the structure and architecture, and let the AI handle the implementation details. I’ve thought about using a graph database to model code structure and track changes over time.

But again - I don’t have the perfect solution. Maybe there’s a completely different approach we haven’t discovered yet.

Bottom line

Right now, these two manual steps eat into the productivity gains of AI coding assistants. Every time I use them, I’m verifying and reviewing.

To truly let the agents loose, we need to automate away these manual steps. More guardrails to handle the verification. Better tools to understand changes at an architectural level.

If we solve this, we unlock real productivity gains - thinking about systems and architecture while the AI handles implementation details.

What approaches are you using to make AI coding assistants more reliable? I’d love to hear about it.

The determinism problem#

Solution space: Guardrails everywhere#

The cognitive load problem#

A new way to view changes#

Bottom line#

The determinism problem

Solution space: Guardrails everywhere

The cognitive load problem

A new way to view changes

Bottom line