Measuring Developer Productivity with AI Coding Assistants: Throughput and Quality

Measuring Developer Productivity with AI Coding Assistants: Throughput and Quality

AI coding assistants promised to make developers faster. Companies rushed to adopt GitHub Copilot, Amazon CodeWhisperer, and Tabnine, expecting a 50% boost in output. But two years in, the results are mixed. Some teams shipped features faster. Others got bogged down by buggy code, endless reviews, and confused engineers. The truth isn’t in the vendor claims-it’s in the data. And the data shows that throughput and quality don’t move together. When one goes up, the other often drops.

What You’re Measuring Matters More Than What You Think

Most teams start by counting lines of code or how often developers accept AI suggestions. That’s like measuring a chef’s skill by how many recipes they copy from a cookbook. You might get food on the table faster, but is it good? Is it repeatable? Will the customer come back?

A 2025 study by the METR Institute tracked 42 experienced developers working on real open-source projects. Half used AI assistants. Half didn’t. The AI group took 19% longer to complete tasks-even though they thought they were 20% faster. Why? Because the code they generated needed heavy fixes. The AI wrote syntax that looked right but missed edge cases, poor error handling, or undocumented assumptions. Developers spent more time reviewing, debugging, and rewriting than they saved writing.

Acceptance rate-the percentage of AI suggestions accepted-is one of the worst metrics out there. GitLab found teams with 35%+ acceptance rates saw no improvement in delivery speed. Why? Because developers accepted suggestions they later had to completely redo. The AI didn’t save time. It just moved the work from typing to reviewing.

Throughput Isn’t Just About Speed

True throughput means delivering working software that customers can use. Not just code pushed to a repo. Not just pull requests merged. Real features that solve real problems.

Booking.com, after rolling out AI tools to 3,500 engineers, saw a 16% increase in throughput-but only after they changed how they measured it. They stopped tracking how fast developers wrote code. Instead, they tracked:

  • How many features actually reached customers per week
  • How long it took from feature request to live deployment
  • How often those features caused production incidents
That’s the difference between counting keystrokes and counting outcomes. AI helped them automate boilerplate code-login flows, API endpoints, test stubs. That freed up engineers to focus on logic, user flows, and edge cases. But only when they added guardrails: mandatory code reviews for AI-generated code, stricter testing requirements, and automated linting that flagged AI-specific patterns.

Quality Isn’t a Trade-Off. It’s the Foundation

A lot of teams think they have to choose: faster delivery or better code. They don’t. But they do have to measure both.

AWS’s CTS-SW framework calls this “tension metrics.” When AI speeds up one part of the pipeline, it creates pressure elsewhere. For example:

  • Product managers get flooded with half-baked feature ideas generated by AI, forcing them to spend more time clarifying requirements
  • Senior engineers become code reviewers for AI output instead of architects, slowing down mentorship
  • Security scans spike because AI-generated code often ignores authentication flows or uses deprecated libraries
Block, a company with 4,000 engineers, built an AI agent called “codename goose” to help teams. Their secret? They didn’t just track speed. They tracked understandability. They used static analysis tools to score how easy AI-generated code was for other developers to read and modify. If the code scored below a threshold, it triggered a mandatory pair-review. That cut long-term maintenance costs by 30%.

Quality isn’t about zero bugs. It’s about sustainable code. Code that doesn’t become a liability six months later.

A decaying codebase with ghostly hands pulling buggy functions from a glowing screen.

The Hidden Cost: Developer Experience

AI doesn’t just affect code. It affects people.

At Booking.com, 78% of engineers said they liked using AI for routine tasks. But 63% worried the codebase was becoming harder to maintain. One engineer wrote: “I spent 40 minutes fixing an AI-generated function that looked perfect but didn’t handle time zones correctly. I didn’t write it. I didn’t even know it existed until the bug report came in.”

This is the paradox: AI makes individuals feel more productive, but teams feel less in control. Developers report higher stress levels when they can’t trust the code around them. Turnover in teams using AI without guardrails is 22% higher than in teams that don’t use it, according to a 2025 Stack Overflow survey.

The best teams treat AI like a junior developer-someone who needs supervision. They:

  • Require at least one human review for all AI-generated code
  • Use automated tests to catch AI-specific failures (e.g., missing null checks, hardcoded keys)
  • Hold monthly “AI retrospectives” where teams share what worked and what backfired

How to Measure This Right

Forget vendor dashboards. You need your own system. Here’s what works:

  1. Start with two teams doing similar work. Give one AI tools. Let the other work normally. Run this for at least two full release cycles.
  2. Track four core metrics:
    • Features delivered per week (not PRs merged)
    • Production incidents caused by new code
    • Time from feature request to customer use
    • Developer satisfaction score (survey every 4 weeks)
  3. Segment your data. Use tools like GetDX or Swarmia to tag commits as AI-assisted or human-only. Compare performance between them.
  4. Don’t trust self-reports. The METR study showed developers consistently overestimated their speed gains. Ask them how long a task took after they finished-not while they were doing it.
Executives celebrate AI speed while trapped developers scream inside code blocks.

What the Best Companies Do Differently

The companies winning with AI aren’t the ones with the fanciest tools. They’re the ones with the clearest rules.

- Booking.com: Used AI to cut boilerplate, but added automated testing and code ownership rules. Result: 16% faster delivery, 12% fewer incidents.

- Block: Built internal tools to score AI code for maintainability. Result: 30% less time spent on legacy fixes.

- GitLab: Abandoned acceptance rate entirely. Now tracks “rework rate”-how often AI-generated code gets rewritten. Result: 40% drop in rework after setting a 15% cap.

- Amazon: Required all AI-generated code to pass security scans before merge. Result: Zero critical vulnerabilities from AI output in 2025.

They all measure the same thing: business impact, not developer activity.

Where This Is Headed

By late 2026, 85% of large enterprises will use “tension metrics” to balance speed and quality. The SEC is already requiring financial firms to prove AI-generated code meets the same audit standards as human-written code. That’s not a trend-it’s a mandate.

The next wave won’t be about writing code faster. It’ll be about writing code that lasts. About teams that trust their systems. About engineers who aren’t drowning in AI-generated noise.

AI isn’t replacing developers. It’s revealing which teams are actually good at engineering-and which ones were just good at typing.

Does using AI coding assistants actually make developers faster?

It depends. For simple, repetitive tasks like writing boilerplate code or generating test stubs, yes-developers can save 20-30% of their time. But for complex problems, especially in mature codebases, AI often slows things down. A 2025 study found experienced developers took 19% longer to complete realistic coding tasks when using AI, because they spent more time fixing bugs and clarifying unclear code. Speed gains are real, but they’re often canceled out by quality issues.

Why is acceptance rate a bad metric for measuring AI productivity?

Acceptance rate measures how often developers click “accept” on AI suggestions-but not whether those suggestions were useful. Many developers accept suggestions they later rewrite entirely. One team had a 38% acceptance rate but saw no improvement in delivery speed because every AI-generated line needed heavy edits. Acceptance rate rewards quantity, not quality. It’s like measuring a writer’s skill by how many words they copy from a dictionary.

What metrics should I track instead of lines of code or acceptance rate?

Track outcomes, not activity. Focus on: features delivered per week (that customers actually use), time from feature request to live deployment, production incidents caused by new code, and developer satisfaction. These show whether AI is helping or hurting your real business goals. Tools like GetDX’s DX Core 4 or AWS’s CTS-SW framework provide structured ways to measure these.

Can AI coding assistants hurt code quality?

Yes, and often they do. AI generates code that looks correct but misses edge cases, poor error handling, undocumented assumptions, or security flaws. Teams that don’t enforce code reviews, automated testing, or maintainability scores end up with codebases that are harder to understand and more expensive to fix. Block found that AI-generated code scored lower on maintainability metrics, so they added mandatory pair reviews for all AI output-and cut long-term fixes by 30%.

How do I know if my team is using AI effectively?

If your team is shipping features faster, with fewer bugs, and developers feel less burned out-you’re doing it right. If you’re seeing more PR rework, longer review times, or complaints about “AI noise” in the codebase, you’re not. The best teams use AI for routine tasks, enforce strict review policies, and measure business impact-not coding speed. Start with a controlled experiment: compare two similar teams, one with AI and one without, over two releases. Then measure what matters: delivery speed, quality, and team happiness.

3 Comments

  • Image placeholder

    Raji viji

    December 15, 2025 AT 23:49

    Let’s be real - AI coding assistants are just glorified autocomplete with a side of delusion. I’ve seen devs accept 80% of suggestions, then spend 3x the time debugging nonsense like ‘const user = await fetch(‘/api/user’);’ without a single try-catch. It’s not productivity - it’s technical debt on turbo mode. Companies are measuring keystrokes like they’re still in 2012. Wake up. The code isn’t ‘done’ just because it compiles.

  • Image placeholder

    Rajashree Iyer

    December 17, 2025 AT 15:20

    It’s not about speed. It’s about soul. Every line of AI-generated code is a whisper of someone else’s logic creeping into your team’s DNA. You think you’re saving time? You’re just outsourcing your responsibility to a machine that doesn’t care if the app crashes at 3 AM. I’ve seen engineers cry over bugs they didn’t write - because they didn’t *feel* them. That’s the real cost. The silence after the AI clicks ‘accept’… that’s the sound of engineering losing its heartbeat.

  • Image placeholder

    Parth Haz

    December 19, 2025 AT 08:24

    While the concerns raised are valid, it’s important to recognize that AI tools are still evolving. The key is not to reject them outright but to implement structured governance. Teams that combine AI assistance with mandatory code reviews, automated quality gates, and regular feedback loops have seen measurable improvements in both velocity and reliability. The goal should be augmentation, not replacement - and that requires thoughtful process design, not just tool adoption.

Write a comment

LATEST POSTS