How can management decisions lead to the derailment of a train and the deaths of 107 people? In the case of the Amagasaki derailment—and countless other incidents—it came down to the flawed use of metrics.
Poorly chosen metrics can cause serious damage even if lives aren’t at stake. In software development, we measure productivity by proxies like lines of code, number of pull requests, and code coverage. More subtle activities such as mentoring teammates, eliminating unnecessary work, and solving issues before they become apparent, are all difficult to quantify. Since these impactful contributions don’t move the needle on standard measurements, they can go unnoticed until it is too late, when the organization is irreversibly failing.
As a trained scientist, I know firsthand how essential it is to test assumptions by analyzing real data. Early in my career, I leaned heavily on metrics, believing they would provide support for my management decisions. I learned the hard way how easily that can backfire. Placing too much emphasis on the wrong software metrics puts us at risk of measuring what is easy to measure, rather than what is most important.
Metrics are a double-edged sword. Ignore them, and you risk making decisions based on gut feeling alone. Rely on them blindly, and you create perverse incentives that can undermine your actual goals.
So, I set out on a quest to understand how we use metrics in science and business: what makes them valuable, how to wield them effectively, and—most importantly—how to avoid turning them into traps.
In one of my earliest experiences as a team lead, a client of mine wanted a social network for tourists, and I partnered with two other engineers to build it. After a while, though, I got frustrated. I was working overtime to meet the project's deadline, yet my partners didn’t seem to be pulling their weights! Given my background in science, I decided to find ways to validate my assumption. The first idea that popped into my head was measuring lines of code (LOC) written by each of us. The result was staggering: I had written about 80% of the code, while my two colleagues had shared the remaining 20% between themselves. What did I do? Of course, I confronted them about it, with predictably horrible consequences. One of us would end up gaming the metric within a month.
I would not be the first person to feel that my coworkers were slacking off, based upon their quantitative output. In the late 1800s, a factory foreman named Frederick Taylor observed that his teams took three times longer to finish a task than they ought to, in his opinion. Taylor went on to develop his famed theory of "Scientific Management" (aka "Taylorism"). His goal was to cut manufacturing costs by increasing worker productivity and efficiency. He used a stopwatch to track how long each part of a task took, designed shovels that made the most efficient use of a laborers' efforts, and so on.
Taylor's method of directly intervening into work processes at an individual level sounds like a micromanaging nightmare. But Taylorism proved successful enough that it evolved and spread throughout different industries over the next hundred-plus years. As knowledge work became more common in the 1950s and 60s, managers ran up against the same problems that Taylor had almost a hundred years prior. Workers seemed inefficient, and it was difficult to make projections of when (or if!) a task might finish. When they were done, the results were of mixed quality.
Management consultants who had cut their teeth on methods rooted in Taylorism were eager to apply the same methods to this new breed of workers. Software developers got no special treatment, and LOC was an easy target. A report from the 1990s, titled "Software Size Measurement: A Framework for Counting Source Statements" avers "Size measures have direct application to the planning, tracking, and estimating of software projects. They are used also to compute productivities ...".
Fast forward to today, and we find a plethora of developer productivity and code quality measurements: everything from test coverage, defect density, and cyclomatic complexity to more comprehensive systems like DORA's key metrics.
Is it any wonder that in my early days, I had focused on something quick and dirty like LOC as a proxy for developer productivity?
However Taylorism has more than its fair share of critics. The debate around when, how, and what kind of metrics to use regularly flares up in the wake of incidents that involve the loss of lives.
On 25 April 2005, at 9:15 am, a train crossed Tsukaguchi Station four minutes late. The train driver was trying to make up for the delay, ran a red light, ignored speed limits, and crucially failed to slow down before a curve. Your gut reaction might be that the driver was reckless, and you wouldn’t be wrong. However, there is more to the story.
JR West—the train company’s operator—had strict rules around delays as short as 60 seconds, and drivers failing to meet these standards faced harsh punishments. The driver in our story had been subject to these punishments before and was likely trying to avoid them, leading to the derailment of the train and the death of 107 people, including himself. While this does not absolve him of responsibility, it raises questions about JR West's policies. Had a punctuality metric been prioritized above safety?
However, the consequences are not always as easy to notice. In an effort to increase transparency in the UK, heart surgery patients are given online access to their doctors success rate. When cardiac surgeon Samer Nashef decided to anonymously survey his peers, he found that one in three of them actively refused to perform risky procedures to avoid their score getting worse. This has potentially led to people not receiving life saving treatment because their doctors were afraid of them dying on the operating table, thereby impacting their public performance data.
These incidents are representative of cases where metrics created perverse incentives. It seems as though when a person’s performance gets measured, they either knowingly or unintentionally game those metrics to the point where the metrics become meaningless or counterproductive. This is Goodhart’s Law in action.
Blaming employees for gaming metrics may seem like a straightforward solution, but it is both counterproductive and unfair. The very metrics meant to expose manipulation are themselves susceptible to Goodhart’s Law, making it difficult to distinguish genuine productivity from attempts to optimize for the numbers. More often than not, employees aren't deliberately gaming the system—they are simply responding to the incentives placed before them. Even when someone is aware that their actions are questionable, the perceived benefits of meeting a critical target can easily outweigh the risk of consequences, especially when those consequences are unclear or inconsistently enforced.
Some organizations attempt to counteract this by secretly tracking additional metrics to catch those who exploit the system, but this only fosters a culture of distrust, where employees feel watched and pressured rather than empowered to do meaningful work. Worse still, when flawed incentives affect an entire workforce, punishing individuals becomes both unfair and impractical.
The real solution is not relentless policing but ensuring that employees aren’t placed in situations where gaming metrics feels necessary or beneficial in the first place. But how can we achieve that?
In sharp contrast to Goodhart’s law stands the management adage “you can’t manage, what you don’t measure”. Yet once we understand the unintended consequences of metrics we may be tempted to avoid them by relying instead on gut feelings. However, cognitive biases have resulted in incorrect assumptions, conspiracy theories, or hunches that lead to poor decisions.
A striking example of this can be found in the UK’s National Health Service (NHS) reforms of the 1990s. Facing financial pressure, policymakers introduced a competitive market model, assuming that competition between healthcare providers would lead to lower costs and improved patient outcomes.
However, reality didn’t align with this assumption. A study released a decade later revealed that while waiting times had indeed shortened, death rates following emergency heart attack admissions had risen substantially. [1], [2] The increased competition had inadvertently shifted priorities: hospitals focused on meeting efficiency targets but at the cost of human lives.
Here we have a case where a politically motivated, “gut” decision likely led to needless deaths. Tracking the metrics early and often, and making adjustments based on those results, might have actually saved lives.
So it seems that metrics are not always evil, sometimes they can save lives too. Interestingly, the same metric—death rate—created the perverse incentive in the heart surgeries examples before, but helped reveal the problem with the internal markets. This raises a key question: is there such a thing as an inherently good or bad metric, or does it all depend on the context in which the metric is used and the scientific rigor with which it is applied? How do scientists use measurements, and what can we learn from that?
At its core, science embraces skepticism and strives to disprove, not prove, theories. This distinction is crucial. A good scientist approaches every experiment with the mindset, “How might I be wrong?” rather than “How can I confirm I’m right?” Every experiment begins with a hypothesis—a clear expectation rooted in logical arguments and minimal assumptions. The aim is to test whether those assumptions hold up under scrutiny.
Science thrives when its expectations are challenged, not when they are confirmed.
Controlling for variables and repeatability are also hallmarks of good science. These practices minimize bias and produce results that are reliable across contexts. Scientists ensure that the measurement does not affect the system in a way that would distort the value measured. This is almost impossible when the subject of our measurement is aware of the measurement being taken, i.e. when judging the performance of individuals or groups of individuals based on metrics.
This approach contrasts sharply with how metrics are often used in management. Instead of starting with hypotheses, metrics-driven systems typically begin with a target—choosing a number to optimize—and then monitor what moves the needle. This creates a feedback loop where the system adapts to the metric in ways that can be unintended and counterproductive. A less harmful situation is when individuals or teams are not directly rewarded or penalized based on a specific metric—meaning their compensation, promotions, or job security remain independent of it. However, when these factors are tied to the metric, the risk of unintended consequences escalates, often with disastrous results.
And that brings us back to the quote, “What gets measured, gets managed.” It is not only misattributed to Peter Drucker, it is also widely misunderstood. Drucker never said this. In his book “The Effective Executive” he actually argued that knowledge work can not be measured the same way as manual work. His point was more that if something is tracked for a long time, it becomes the only thing managed, while everything else risks being overlooked.
Metrics can be a powerful tool for grounding us in reality, but when used with the wrong mindset, they can just as easily serve to justify preexisting biases under the guise of scientific validation. Therefore the key isn’t abandoning metrics altogether, but understanding when and how to use them effectively.
When a metric is actively monitored and optimized, it risks becoming a target and a source of unintended consequences. If possible avoid using metrics as direct levers for change. Instead use them to validate assumptions and uncover hidden patterns.
Keep in mind that all metrics are subject to distorsions. In order to mitigate that consider how measurements might influence behavior in unexpected ways and control for those. Metrics also tend to be narrow in focus, so always rely on multiple metrics to get a more nuanced picture.
Do not trust your metrics. If the results match your expectations you should treat that as an invitation to dig deeper, and not as a confirmation of your prior beliefs. The most meaningful insights often come from unexpected outcomes.
Metrics should also evolve over time. On one hand what made sense six months ago may no longer be relevant or may even be doing harm. On the other hand the longer a metric is used, the more time it has to create systemic issues that result in its distortion. Regularly reassessing your metrics ensures they remain useful rather than counterproductive.
Perhaps the most dangerous mistake is tying incentives directly to specific metrics. When performance evaluations, bonuses, or promotions depend on a number, people will naturally find ways to improve that number—however that often comes at the cost of unintended and harmful side effects. Instead of forcing numbers to move, focus on creating an environment where meaningful work naturally drives the right outcomes.
Remember: metrics should inform rather than dictate decisions. They are useful only when combined with critical thinking, skepticism, and a willingness to adjust course when the data tells an unexpected story.
For a long time I believed that when I used LOC to validate my assumption that my colleagues were not pulling as much weight in the project as I did, my error was the choice of metric. However today I’d like to argue I made several mistakes, and that choice was neither of them.
The first mistake I made was in jumping to conclusions immediately, instead of being suspicious when the metric seemed to have agreed with my assumptions. Having said that, they both agreed that I’d done more work, and they both increased their efforts! If I left it at that, this could have been a good use of a terrible metric.
Unfortunately, I committed another error in letting the metric outlive its usefulness. I used it to decide the share of income each of us received! It wasn’t long before I started generating lots of additional code. While it seemed like a good idea at the time, I’m still not sure if it wasn’t at least partly motivated by the desire to have a larger share of lines under my name.
PS: There’s an interesting backstory to why we started researching this topic that didn’t fit into this already lengthy article. If you want to hear about the surprising thing we discovered about DORA metrics—and how we conducted our research—be sure to subscribe to the Lean Poker newsletter. Next week, I’ll be publishing that story exclusively for subscribers.