Technology

Humans Have Been Bad at Judging Each Other Since Before We Invented Writing — Your Annual Review Proves It

The Clio Method Apr 16, 2026

The World's First Performance Review Was a Disaster

In ancient Babylon, around 1750 BCE, King Hammurabi instituted what might be history's first formal performance evaluation system. Government officials were rated on a scale from "excellent" to "should be thrown to the lions" (I'm paraphrasing, but not by much). The evaluations were supposed to be based on objective criteria: tax collection efficiency, public works completion, crime reduction in their districts.

Photo: King Hammurabi, via i.pinimg.com

It lasted exactly three years before collapsing into the same mess that modern HR departments still can't solve: favoritism, recency bias, and the fundamental human inability to separate "I like this person" from "this person is good at their job."

The clay tablets recording these evaluations still exist. They're hilarious. Officials who increased tax revenue by 300% got marked down for "insufficient respect for authority." Administrators who reduced crime to near zero were dinged for "failure to maintain proper social relationships with superiors." The guy who built a bridge that lasted 400 years got a mediocre review because he "seemed arrogant during meetings."

Sound familiar?

Ancient Rome Tried to Hack Human Judgment

The Romans, being Romans, decided they could engineer their way out of human bias. They created elaborate scoring systems for military promotions, with points assigned for specific achievements: enemies killed, territory conquered, years of service, recommendations from superiors.

It was a disaster of epic proportions.

Within a decade, the system was completely gamed. Officers started inflating enemy casualty counts, claiming credit for other people's victories, and forming mutual admiration societies where they wrote glowing recommendations for each other. The objective criteria became so manipulated that the Roman Senate eventually threw out the entire system and went back to subjective judgment calls.

Photo: Roman Senate, via s-media-cache-ak0.pinimg.com

Which were also terrible, but at least honestly terrible.

The problem wasn't the metrics. The problem was that humans were still doing the measuring, and humans come with built-in cognitive bugs that five thousand years of civilization haven't patched.

Medieval Guilds Thought They Had the Answer

By the Middle Ages, European craft guilds had developed what looked like the perfect solution: peer review. Master craftsmen would evaluate apprentices and journeymen based on the quality of their work, assessed by people who actually understood the craft.

Finally, a system where experts judged expertise. What could go wrong?

Everything, as it turned out.

Guild records from across Europe tell the same story: evaluations consistently favored the sons and relatives of existing masters, regardless of skill level. Innovative techniques were marked down as "failure to follow established methods." Craftsmen who threatened the economic interests of their evaluators somehow always scored poorly on "attitude" and "guild loyalty."

The most skilled metalworker in 14th-century Florence was repeatedly denied master status because his techniques made everyone else's work look outdated. His crime wasn't incompetence — it was competence that made his judges uncomfortable.

This is why your most innovative coworkers often get mediocre performance reviews while the office politicians get promoted. It's not a bug in the system. It's a feature of human psychology that's been running the same code for millennia.

The Cognitive Bugs Are Hardwired

Modern behavioral psychology has names for the mental shortcuts that make fair evaluation nearly impossible: the halo effect, confirmation bias, attribution error, recency bias, similarity bias, and about twenty others. But ancient administrators were dealing with these exact same glitches without knowing what to call them.

Egyptian papyri from the New Kingdom period show that scribes consistently rated other scribes based on how similar their handwriting was to their own. Roman centurions gave higher marks to soldiers who came from the same regions they did. Chinese imperial bureaucrats during the Tang dynasty favored candidates who had similar educational backgrounds.

These weren't conscious prejudices. They were the result of cognitive shortcuts that helped early humans survive in small tribes but become liabilities when trying to fairly evaluate strangers in complex organizations.

Your brain is still running tribal software in a corporate environment. When your manager rates your "communication skills," they're not measuring your actual ability to communicate — they're measuring how much your communication style matches what feels familiar and comfortable to them.

Every performance review in history has been contaminated by this mismatch between our hardware and our environment.

Why Technology Can't Fix Biology

Modern companies have tried to solve this with software: 360-degree feedback systems, AI-powered performance analytics, peer review platforms, objective key performance indicators. The results are depressingly familiar to anyone who's studied ancient evaluation systems.

The metrics get gamed. The feedback gets politicized. The AI learns to replicate human biases at scale. The objective indicators turn out to measure everything except what actually matters.

A 2019 study found that employees rated as "high performers" by modern evaluation systems were no more likely to achieve actual business results than employees rated as "average." The correlation between performance review scores and measurable outcomes was essentially zero.

This isn't because modern HR departments are incompetent. It's because they're trying to use human judgment to measure human performance, and human judgment has been broken in predictable ways since before we invented writing.

The Uncomfortable Truth

Here's what five thousand years of performance evaluation data tells us: humans are fundamentally bad at judging other humans fairly. We're biased toward people who remind us of ourselves, biased against people who threaten our status, and biased toward whatever happened most recently.

These biases served us well when we lived in groups of 50 people and everyone knew everyone else's actual capabilities through direct observation. They're disasters in organizations large enough to require formal evaluation systems.

Every attempt to create fair, objective performance measurement has failed for the same reasons: the people doing the measuring are still human, and humans come with cognitive equipment designed for a world that no longer exists.

The ancient Babylonians figured this out 4,000 years ago, which is why King Hammurabi's evaluation system was quietly abandoned after three years. Modern companies haven't figured this out yet, which is why they keep spending billions of dollars on performance management software that produces the same biased results as clay tablets.

What Actually Works

The most successful organizations in history didn't try to eliminate bias from human judgment. They acknowledged it and designed systems that worked around it rather than pretending it didn't exist.

Roman military units rotated commanders regularly to prevent personal relationships from contaminating evaluations. Medieval guilds required multiple masters to sign off on promotions, making it harder for any individual bias to dominate. Chinese imperial bureaucrats used anonymous examinations where evaluators couldn't know whose work they were judging.

None of these systems were perfect, but they were honest about the limitations of human judgment and built safeguards accordingly.

Modern companies that accept this reality and focus on minimizing bias rather than eliminating it tend to make better decisions about people. The ones that keep believing they can engineer perfect objectivity keep getting the same results that frustrated ancient kings and guild masters.

The hardware hasn't changed. The bugs are still there. The only question is whether we're honest enough to admit it and design around our limitations, or whether we'll keep pretending that this time will be different.

Spoiler alert: it won't be.