~3,100 words; about 17 minute read
TL;DR
- GenAI is sold to developers with more hype than anything I’ve seen
- When studies claim GenAI leads to large productivity gains, follow the money, dig into the details, and be skeptical
- Learn to ask better, deeper questions than the typical study
Catalyst
A video presentation by Dave Farley at Modern Software Engineering recently made the rounds in my software development circles. As of this writing, the video has over 200k views and over 10k likes.
The video highlights a new study that claims AI assistance improved developer productivity by 30-55% while writing code that was just as maintainable as non-AI assisted code.
Since Farley’s name is on the study and he’s the face of it from what I can tell, I’m going to call the study he refers to the “Farley study.”
My Background
I’ve been paid to write software for more than thirty years. I’ve seen and done a lot. I’ve architected enterprise solutions that have served many millions of people. I’ve performed complex integrations. I’ve consulted. I’ve started up. (I’m feeling like when Indiana Jones says, “it’s not the years, it’s the mileage.”)
I’m known for loving high quality code, and an advocate and practitioner of test-driven development (TDD). I care deeply about the craft of software engineering, and especially about deploying technology for the good of people and businesses.
I lived through the dot-com bubble and burst — a period of unprecedented, worldwide hype.
A (hopeful) hallmark of my career is to help people really understand what technology can and can’t do for them. While I’ve loved technology and been blessed to have a career in it, I also see it misapplied and misused. I’ve especially seen Big Tech harm the world with their intentionally unethical practices. But that’s another story.
I share this background to help readers, especially younger developers, know that I’ve seen a lot.
And I’ve never seen hype like that which engulfs today’s generative “artificial intelligence” (GenAI).
Does that mean we should reject GenAI just because it’s so puffed-up with hype? Not at all. But we also shouldn’t just ride along on popular storylines and accept it.
Engaging The Study
So let’s look at Farley’s study. His name is last on the author list. He humbly claims in the video that the other authors did most of the work. But the 151 participants were mostly drawn from Farley’s audience, a group he admits are mostly aligned with his teachings. And later in the video, Farley shares that he teaches courses that help developers use GenAI.
So while that doesn’t invalidate the study on its face, he does stand to gain by promoting GenAI for development. That does, and should, invite skepticism.
What about the other authors?
The first author, Markus Borg, works for CodeScene, who’s headline is, “Scale AI coding safely without sacrificing quality.” Obviously they have a dog in this fight. In fact, their headline almost summarizes the outcome of the study they’re promoting.
Three other authors work for Equal Experts, a consulting company who offers AI development and AI accelerated delivery. Another three are from Lund University’s Department of Computer Science in Sweden. Lund is hugely invested in AI.
For a pro-AI paper that claims 30-55% reduction in task completion time with no reduction in maintenance time or quality, the fact that all the authors profit from services that support AI for development is a reason to at least raise an eyebrow.
By comparison, consider a second study, one by the Model Evaluation and Threat Research (METR)from July 2025. That study showed a 19% decrease in developer productivity.
METR says they are “a research nonprofit which evaluates frontier AI models to help companies and wider society understand AI capabilities and what risks they pose.”
METR has not accepted funding from AI companies, though we make use of significant free compute credits, as noted above. Independent funding has been crucial for our ability to pursue the most promising research directions and set standards for evidence-based understanding of risks from AI. It is also part of how we ensure that our research is as accurate as possible.
So METR doesn’t profit from selling AI services. They seem to be funded by people honestly curious about how AI is really working out.
Also, compare the introductions of these two papers.
Farley’s study starts with:
Generative AI is rapidly transforming software development, disrupting the discipline as we know it. Tools based on Large Language Models (LLMs), such as GitHub Copilot and ChatGPT, have seen widespread adoption among developers […] The appeal of AI assistants for code synthesis is clear and, as we will review in Section 2.3, several empirical studies, in fact, suggest that working with them can lead to significant productivity gains.
Rapid transformation, disruption, widespread adoption, clear appeal, significant gains. One could be excused for calling this, “hype.”
The METR study more modestly introduces their findings with:
Software development is an important part of the modern economy, and a key domain for understanding and forecasting AI capabilities. Frontier AI systems demonstrate impressive capabilities on a wide range of software benchmarks and in experiments measuring AI’s impact on developer productivity when completing synthetic tasks. However, tasks used in these lab experiments sacrifice realism for scale and efficiency: the tasks are typically self-contained, do not require much prior context/familiarity to understand and complete, and use algorithmic evaluation metrics which do not capture many important capabilities. As a result, it can be difficult to draw inferences from results on these evaluations about AI’s impact in practice.
I think Farley’s study sacrifices “realism for scale and efficiency,” making it “difficult to draw inferences from [their] results.” While Farley’s team did try to make tasks that felt like typical customer requests, they ultimately landed on one self-contained, narrowly bounded, two-step problem to solve. In contrast, the METR study measured real-world work performed by experienced open source developers on codebases they already knew well.
And lest you think that the METR study is biased against AI for development, consider:
… these results do not imply that current AI systems are not useful in many realistic, economically relevant settings. Furthermore, these results do not imply that future models will not speed up developers in this exact setting—this is a salient possibility given the rapid pace of progress in AI capabilities recently. Finally, it remains possible that further improvements to current AI systems … could yield positive speedup in this setting. (p. 3)
METR isn’t discounting the future possibility that AI may improve development productivity. They’re just pushing back against studies like Farley’s by measuring developers working on real world tasks vs. synthetic, scripted tasks.
Questionable Comparisons
The Farley study uses Bayesian analysis to calculate their 30-55% “completion time” improvement metric. Part of Bayesian calculations require the choice of a “prior,” a measure that helps provide boundaries by which new data should be evaluated.
I’m not a mathematician. I don’t even play one on YouTube.
But Farley’s authors chose to use Microsoft’s famous 2023 GitHub Copilot study to pull a 55% speed improvement as an “optimistic prior,” or what might optimistically be expected to occur in their measurements.
(Does anyone wonder whether Microsoft may have a reason to tout a study like this, given the unfathomable billions they’ve invested in OpenAI and their own infrastructure? Almost like they have an existential reason for this claim to be true?)
What did the Microsoft study measure? The METR study says it well:
For example, Peng et al. [the Microsoft GitHub Copilot study] asks developers to implement a very basic HTTP server in JavaScript to satisfy several automatic test cases that are shown to the developers—this task is a) unrepresentative of most software development work, and b) likely to be similar to a large amount of LLM training data, which may unfairly advantage AI systems relative to humans. (p. 3)
Another way you could describe what Microsoft tested is: “how much faster could a developer create a JavaScript HTTP server if they copied most of the code from another JavaScript HTTP server?” If the number was “55% faster,” we wouldn’t be surprised.
So if Farley’s study uses Microsoft’s claim in their calculation for speed, that, to me, calls their results even more into question.
Interestingly, an article Farley’s cites in his video also includes different completion time stats than Farley’s study:
The real range, when it’s measured in terms of team outcomes like delivery lead time and release stability, is roughly 0.8x – 1.2x, with negative effects being substantially more common than positives.
Obviously, +/- 20% is a very different outcome than 30-55% faster.
To try and refine the efficiency question, Farley’s team “complement[ed] the direct measurements with participants’ self-reported perceptions of productivity” (p. 55) Again, they find no significant difference. But is this data worth listening to?
In the METR study, developers felt 20% faster, but they were actually 19% slower. So how a developer feels about their productivity may not be a great measure, certainly not enough to back up a claim designed to encourage developers to use GenAI.
Equally Easy To Maintain And Of Equal Quality?
The main point of Farley’s study is to claim that AI-assisted code is no harder to maintain and is of no less quality than code written without AI assistance. This study tried hard to set up test conditions that would help them make this claim with a credible level of confidence.
Because they wanted to debunk what Farley calls “fear mongering” on this point.
But did they? To me, the task that supposedly shows whether the code was equally easy to maintain was so relatively trivial that the results could be interpreted to promote any narrative the authors wanted.
The measured task was to modify a recipe app’s search feature added in a first task with a trivial option to filter by cost per serving. They measured maintainability by evaluating how quickly developers could change the code that had been written in the first task either with or without AI assistance, and they measured how productive developers felt afterwards.
They say, “We found substantial variability in Task 2 completion time, with no significant differences between the treatment [modifying AI code] and control [modifying non-AI code] groups” (p. 54). The big difference in completion times (a range of about 2-13 hours, p. 32) was due to the time each developer needed to set up their environment, understand the project, and do the work according to their varied “definition of done.”
This sentence seems telling: “We consider [calculation issues] another indication that the available data was insufficient to draw firm conclusions related to completion time” (p. 54) They then cite studies like the Microsoft GitHub Copilot one to help firm up their “insufficient” data.
And end up claiming that there was no measurable difference between completion time when working on AI and non-AI assisted code.
Maybe I’m missing something, but it looks to me like the absence of a clear signal became the signal itself. They seem to use their inconclusive data to show that AI assistance doesn’t harm code quality, a claim that Farley says, “frankly, given some of the fear-mongering, that’s a pretty significant result. And a finding, that as far as I understand it, is new to this research.”
So the video claims maintenance time isn’t harmed, but the details in the study show little reason for making such a large claim, especially in the light of other studies to the contrary (next heading).
In a helpful section about the “limits and threats to validity” (p. 59), Farley’s co-authors rightly share many issues that could make the data less valid. On the completion time front, “many participants did not complete the task in one uninterrupted session” leading them to give a “best time estimate.” In other words, many completion times were not accurate at all. But completion time is a foundation of their headline claim.
On the quality front, they cite the risk that there was a “substantial variation in participants’ implicit definitions of ‘good enough’ code quality for submission.” Some developers went above and beyond, and others did the bare minimum. Quite subjective, seemingly pushing against their claim that their “study has resulted in several novel insights backed by empirical data” (p. 57).
A Competing Study
In a recent study by CodeRabbit (who sells a service to improve AI code reviews) found that “AI code creates 1.7x more problems.” They say:
We analyzed 470 open-source GitHub pull requests, including 320 AI-co-authored PRs and 150 human-only PRs, using CodeRabbit’s structured issue taxonomy. Every finding was normalized to issues per 100 PRs and we used statistical rate ratios to compare how often different types of problems appeared in each group.
The results? Clear, measurable, and consistent with what many developers have been feeling intuitively: AI accelerates output, but it also amplifies certain categories of mistakes.
The top 10 issues in this study are especially concerning: logic/correctness problems were 75% more common, error handling problems happened twice as often, and security vulnerabilities were 2.74x higher. Many other issues of the 470 PRs they evaluated were cited.
To be consistent with my earlier warnings, please note that the problems cited by CodeRabbit are exactly what the service they’re selling tries to mitigate. So they’re not saying to avoid using AI for development, but that if you use AI, you’re going to have a lot of problems unless you use their tool to defend against these problems.
So who do we believe? What should we trust, when so much is at stake, and there are so many interests backing the various claims?
Asking Better Questions
I’ve gone into quite a bit of detail here because studies like Farley’s are used to promote what I see as an unjustified claim of productivity. I think we should question the study itself, the methods of assessing the data, and the motives of the authors.
But what if I’m wrong, and they’re right? Is efficiency still a good justification?
Maybe. But to me, efficiency is just one consideration among many. We do want to be efficient, but to what end? If we turn a dial of efficiency that ends up degrading innovation, creativity, or long-term skill retention, have we made a good trade?
Consider: How long does it take to build a feature that your business really needs to thrive and overcome competition? How do you design the right architecture for your business to thrive? What fosters true innovation and creativity on a dev team?
How can your developers really understand their codebase so they can make wise and insightful design decisions in the face of real world tradeoffs?
And most concerning to me: how are developers changed by their embrace of GenAI? The hype-filled stories of inevitable progress may be leading towards a harmful destination. And Big Tech has proven not to care at all about whether people are harmed by their products with their “move fast and break things” ethic.
Studies like Farley’s don’t ask questions like these. I think we should.
Beware “Digital Deskilling”
Cal Newport recently encouraged developers to beware of “digital deskilling.” After critiquing the similarly tainted, self-benefiting, hype-filled claims of Anthropic’s head of Claude Code, he asks us to “be wary of any such demonstration.” Newport continues:
A world in which software development is reduced to the ersatz management of energetic but messy digital agents is a world in which a once important economic sector is stripped down to fewer, more poorly paid jobs, as wrangling agents requires much less skill than producing elegant code from scratch. The consumer would fare no better, as the resulting software would be less stable and innovation would slow.
The only group that would unambiguously benefit from deskilling developers would be the technology companies themselves, which could minimize one of their biggest expenses: their employees.
Nobody is helped by a path where developers become dependent on AI and lose their proficiency in the discipline of software engineering itself. Society at large, our companies, and we developers all lose bigtime.
But that concern is pooh-poohed as “fearmongering” by people like Farley who sell products/services to participate in the GenAI gold rush.
You might say, “Not me; I’m careful. I’m watching myself so I just use AI as a tool for good and am not going to lose my skills. I’m not going to become dependent.”
I hope that’s true. But in a powerful new article, “How AI Destroys Institutions,” law professors Hartzog and Silbey share warnings that everyone needs to consider. It’s not just the potential loss of individual skills, but of the institutional uniqueness driven by human relationships that are at risk in our AI adoption.
Think about it. The secret sauce of any development team is their ability to work together to solve complex problems in more creative and helpful ways than their competitors. Developers working well together are a “force multiplier.” Not GenAI.
But again, you might say, “not me.” Hartzog & Silbey would warn:
Perhaps if human nature were a little less vulnerable to the siren’s call of shortcuts, then AI could achieve the potential its creators envisioned for it. But that is not the world we live in. Short-term political and financial incentives amplify the worst aspects of AI systems, including domination of human will, abrogation of accountability, delegation of responsibility, and obfuscation of knowledge and control.
People are only human. It is unreasonable to expect the kind of superhuman willpower necessary for all of us at scale to indefinitely avoid the worst temptations of AI.44 Even if it were feasible to ensure accountability for the design and function of these systems, AI is not the fix for institutions that efficiency enthusiasts have been looking for. It is a poison pill that will extract a substantial cost upon institutions, even in its most optimal deployments.
If we’re justifying our use of GenAI with shaky claims of productivity while ignoring the risks, how are we not falling prey to our “human nature,” our desire to fit in and go along with the crowd, our fear of missing out, our magnetic draw to the new and shiny?
Because even software developers don’t have “superhuman willpower,” as objective and disciplined as we all try to be.
Conclusion
GenAI is not an inevitable success. Nobody knows how it will turn out. But hype-filled studies backed by authors who stand to gain by our embrace of their headline claims aren’t likely to lead us to a happy place.
Wherever you land after reading this, please try to evaluate the stories you’re living in. Anyone can be influenced by the myth of progress, inevitability, FOMO, or the tool trope today.
Most importantly, reach for deeper questions than those typically featured in today’s GenAI headlines.
Photo by ThisIsEngineering
