Why Performance Reviews Fail | Sageo Blog

About This Series

In Blog 3, we looked at how to find and hire exceptional people. In this post, we look at what happens after they join. Most performance management systems are built on a false assumption about how human performance distributes. That assumption shapes everything from how people are rated to how managers behave, and it quietly destroys the talent density you worked so hard to build.

The assumption nobody questions

Most performance review systems were designed on a belief that felt intuitive at the time: that human performance follows a normal distribution. Plot your people on a bell curve, and most cluster in the middle. A few are exceptional. A few underperform. Reward accordingly.

This belief underpins almost every performance management system built in the last fifty years:

Forced ranking and stack ranking systems
The 10-70-20 performance distribution model
Compensation bands tied to position in the curve
Rating quotas that cap how many people can receive top scores

There is one problem: for knowledge workers, it is empirically wrong.

Ernest O'Boyle Jr. and Herman Aguinis published a landmark study in 2012, covering over 633,000 individuals. They found that 94% of those populations followed a power law, not a normal distribution. A small number of exceptional contributors produced a disproportionate share of total output. In complex cognitive roles, the gap between the best and the average is not 20%. It is ten times or more.

When you build a performance management system on the wrong assumption, everything downstream is distorted. You rate exceptional people as good. You protect underperformers because the curve requires a middle. And you drive out your best people, who notice that the system was not designed to see what they actually do.

94%

Of studied populations follow a power law, not a bell curve

O'Boyle & Aguinis, 2012

58%

Of executives say their PM system drives neither engagement nor high performance

Deloitte Global Human Capital Trends

30%

Drop in voluntary attrition at Adobe after eliminating annual performance reviews in 2012

Adobe Internal Research

What forced ranking does to the people you most want to keep

The most vivid case study in the cost of the bell curve assumption is Microsoft between 2001 and 2012, the period journalist Kurt Eichenwald investigated in a landmark Vanity Fair article titled 'Microsoft's Lost Decade' (2012).

Under CEO Steve Ballmer, Microsoft used stack ranking: every team, regardless of overall quality, had to designate a fixed percentage as poor performers. The documented consequences:

Top performers deliberately avoided working with strong colleagues who might outrank them
Information was withheld from peers who were effectively competitors in the ranking
Significant energy went into managing the manager's perception rather than doing great work
Collaboration became a liability. Every colleague was a competitor

One former employee told Eichenwald: “If you were on a team of ten people, you walked in knowing that, no matter how good everyone was, two or three people were going to get a terrible review. It wasn't about performance. It was about politics.”

During this period, Microsoft missed the smartphone revolution, the search engine era, the social media wave, and the tablet market. The company that had been the most valuable in the world stagnated for a decade. Stack ranking was widely cited as a significant cultural contributor.

Microsoft vs. Netflix: two very different models

The contrast between Microsoft's historical approach and Netflix's Keeper Test is one of the most instructive comparisons in modern talent management. Both are serious, high-performance organisations. Both held their people to high standards. The difference is in the design of the system underneath.

Stack Ranking vs. The Keeper Test: A Comparison

Dimension	Microsoft Stack Ranking (Historical)	Netflix Keeper Test
Philosophy	Forced ranking / Zero-sum competition. A fixed percentage must always be rated poor.	Professional sports team model: absolute, not relative, standards. No quota on excellence.
Mechanics	Pre-determined percentage-based grades. A fixed proportion must be rated poor regardless of actual performance.	Continuous reflection: 'Would I fight hard to keep this person?' Evaluated against role requirements, not peers.
Effect on top talent	Forces top performers to avoid strong teammates to protect their relative rating. Collaboration becomes a liability.	Retains and groups elite talent. Encourages collaboration because individual success is not threatened by strong peers.
Effect on collaboration	Incentivises withholding help and, in some cases, peer sabotage. Every colleague becomes a competitor.	Encourages high collaboration alongside personal accountability. Strong peers are an asset, not a threat.
Primary failure mode	Destroys psychological safety. Contributed to a decade of Microsoft stagnation documented in Vanity Fair, 2012.	Can induce chronic anxiety if implemented without transparency, care, and continuous honest dialogue.

The critical distinction is between relative and absolute standards. Stack ranking evaluates people against each other. The Keeper Test evaluates people against what the role actually requires. In a stack ranking world, being exceptional on a team of exceptional people is dangerous. In a Keeper Test world, it is the whole point.

What the Keeper Test actually is and how it works

The Keeper Test is deceptively simple. Netflix asks managers one question about each person on their team: if this person told me they were leaving for a similar role at a competitor, would I fight hard to keep them?

The answer drives two very different sets of actions:

Yes: I would fight hard to keep them

Invest actively

Expand scope and responsibility
Increase compensation to reflect market rate
Give stretch assignments and senior visibility

Outcome: Talent density maintained or raised. The high performer stays, recruits peers, and elevates team standards.

No: I would not fight to keep them

Have the honest conversation

Agree a transition timeline with dignity
Provide a generous severance package
Reopen the role with a higher bar

Outcome: Role reopened with a higher bar. Talent density protected from the slow drift of gradual dilution.

From Netflix's Culture Memo

“Adequate performance gets a generous severance package.” This line is designed to be provocative. What it means in practice is that Netflix is committed to ensuring every seat is held by someone exceptional, and that when an exit happens, it is handled with genuine care for the person leaving.

Making feedback safe to give and receive: the 4As model

The Keeper Test without a feedback culture is management by surprise, which is one of the most destructive things a leader can do to a team. Netflix's answer to this risk is the 4As model. It trains both sides of every feedback interaction so that continuous evaluation does not collapse into subjective judgment and fear.

The 4As Feedback Model: Roles, Principles, and What Each Prevents

Role: Giver

Aim to Assist

Feedback must benefit the recipient, not serve the giver's interests or ego. 'You missed the deadline' becomes 'Your late delivery pushed the launch back 3 days. Here is what I need from you.'

Prevents: Political feedback, passive-aggression, and performance documentation designed to justify a pre-made exit decision.

Role: Giver

Actionable

Identify specific, observable behaviours that can be changed. Focus on actions, not personality traits or character assessments.

Prevents: Vague criticism such as 'You have a bad attitude.' Non-specific feedback that offers the recipient nothing to act on.

Role: Receiver

Appreciate

Acknowledge the feedback. Listen without defensiveness, even when you disagree with the assessment.

Prevents: Defensive dismissal that shuts down future feedback and signals to the giver that candour is not safe here.

Role: Receiver

Accept or Discard

The receiver retains autonomy. They evaluate whether the feedback is valid and decide whether to integrate it. No forced compliance required.

Prevents: Compulsory agreement that creates a compliance culture and removes the receiver's agency in their own development.

Source: Netflix Employee Handbook (AirMason); Netflix Culture Memo

The 4As model matters because it separates the quality of the feedback from the relationship between the people involved. When both sides understand their role, feedback becomes an act of investment rather than an act of judgment.

When performance improvement is real

One of the most commonly misunderstood elements of high-density performance management is what happens when someone is not performing. In most organisations, a Performance Improvement Plan is a paper exercise. Both parties know the outcome is predetermined.

Gorgias reports something different. In their model, approximately 50% of employees placed on a structured improvement plan successfully recover their performance standards when:

The plan is designed with genuine specificity, not generic targets
Managerial investment is real, with regular coaching conversations
The timeframe is honest and clearly communicated
The criteria for success and failure are defined upfront, not revised after the fact

The other 50% who do not recover are exited, but the attempt was real. The difference is not in the paperwork. It is in the intent.

Gorgias also runs bi-yearly cross-functional performance reviews, where each person's contributions are evaluated by a panel that includes people from outside their direct team. This reduces individual manager bias, surfaces contributions that might be invisible to a single manager, and creates a richer, more honest picture of actual performance.

What comes next

We have now covered what talent density is, what it costs when it is low, how to hire for it, and how to manage and evaluate for it. The last piece is the one that holds all the others together.

In Blog 5, we address the question founders ask us most often: can I maintain genuinely high standards and still have a culture where people feel safe, take risks, and bring their best ideas? Or do I have to choose?

Up Next in This Series

Psychological Safety to High Performance

We close the series with the framework that makes everything else sustainable.