Bamse AI Committee: what we found in 2026

Bamse's AI Committee is a cross-team group that exists for a simple reason: we wanted to stop guessing and start measuring what actually changes when you put coding agents inside a real engineering workflow. Not in a demo, not on a throwaway branch - inside a product that has clients, deadlines, and people reviewing PRs. This text is a raw snapshot of what we learned in the first months. Some things confirmed the obvious. Others caught us by surprise. And plenty is still open.

I'll be honest about what didn't work too, because there's no point in writing one more hype piece. We got plenty wrong along the way, and the mistakes taught us more than the wins.

What became obvious

The first lesson was almost disappointing in how simple it was: the harness and the context matter more than the choice of model. We came in thinking every new model release would be a leap. It wasn't. What moves the needle is what you put around the model - the project rules, the available tools, the checkpoints, the right context at the right time. Swapping models gives you a marginal gain. Fixing the harness gives you a structural one. A lot of what I wrote afterward on context engineering and harness engineering came straight out of these meetings.

The second lesson: agents are excellent at well-scoped, verifiable tasks, and dangerous on ambiguous ones. When the starting state is clear, the ending state is clear, and there's a mechanical way to check whether it worked - a migration, a repetitive refactor, test coverage, glue code - the agent flies. When the task requires an architecture decision, or the product spec is still moving, the agent produces a confident, wrong PR, and you spend more time undoing it than you'd have spent doing it by hand. The mental ruler we adopted: if I can't describe how to verify the result, it isn't an agent task yet.

"If I can't describe how to verify the result, it isn't an agent task yet."

The third was the most counterintuitive for anyone not in the trenches: review discipline and guardrails matter MORE with agents, not less. The agent increases the volume of change entering the queue. If the review bar stays the same, the bottleneck just moves - out of the code and onto the human who has to read all of it. We felt this firsthand: agent-generated PRs waiting far longer for review than hand-written ones. The answer wasn't to trust more, it was to armor more: tests as gatekeepers, hooks that run on their own, a human checkpoint at the risky points, small PRs instead of giant diffs.

What surprised us

The good surprise was where the value came from. We were betting on model upgrades. What actually paid off was the boring-looking stuff: well-written skills, hooks that automate the tedium, and curated context. A skill that encodes "how we do X in this repository" pays off every time someone - or some agent - needs to do X. A hook that runs lint and tests after every edit is worth more than ten paragraphs of prompt begging "please don't break anything". The gain didn't come from the model getting smarter. It came from us getting more organized.

The bad surprise was that adoption is a people and process problem as much as a technology one. Handing out licenses isn't enough. There's the trust question - a senior dev who doesn't trust the agent ends up redoing everything by hand and the gain turns to zero. There's onboarding - someone who has never used an agent doesn't know how to scope a task for it, and the early frustration runs high. And there's the annoying question of who owns what: who maintains the prompts, who writes the skills, who reviews the context when it goes stale. Without someone wearing that hat, the tooling rots fast.

* Rule of thumb

The time you save on boilerplate comes back as time spent on review. The balance only goes positive if you invest in the guardrails first, not after.

What is still open

There are three questions we still haven't closed. The first is measuring real productivity. Time-to-PR did drop, yes, but an open PR isn't delivered value. If the PR sits longer in the review queue, or bounces back for rework, the end-to-end gain disappears. We still don't have a metric we trust enough to put on a slide without laughing.

The second is security and permissions. An agent that writes code is also an agent that runs commands, reads secrets, opens files. The more autonomy, the more we need explicit limits - what it can touch, what needs approval, what it can never do. And there's the uncomfortable point that generated code tends to introduce more security holes than hand-written code, so the security review can't be the first thing cut in the name of speed.

The third is knowledge ownership. When a skill encodes how the team solves a problem, that knowledge leaves people's heads and goes into the repository. That's great for onboarding and terrible if nobody maintains it. A stale skill is worse than no skill, because it radiates false confidence. We're still figuring out who should own that body of work and how often it needs to be pruned.

Where this leaves us

The honest summary of early 2026 is this: agents are worth it, but not for the reason we imagined. The gain isn't in the model, it's in the engineering around it. And that engineering - context, harness, skills, hooks, review - is senior work, not a magic button. We came in wanting to go faster and left understanding that the work moved, it didn't shrink. It's less typing and more deciding, specifying, and verifying. To me, that's good news.

What we discovered at Bamse's AI Committee in early 2026

What became obvious

What surprised us

What is still open

Where this leaves us

Liked this piece?

Context engineering in practice: what I learned running agents in production