Law Enforcement

Beyond Population Percentages: How to Measure Fairness in Policing

Beyond Population Percentages: How to Measure Fairness in Policing

Beyond Population Percentages: How to Measure Fairness in Policing

January 2026

Written by Tanaya Devi
Co-Founder & Chief Data Scientist, Sigma Squared

It's annual report season – will yours tell the full story? 

Debates on fairness in policing often start with a simple statistic: only X% of a city’s population is a given demographic, but Y% of police stops involve those individuals. These raw comparisons make for striking headlines and soundbites. It feels intuitive that when a group is stopped more often than its share of the population, something is wrong. Indeed, in one analysis of traffic stops in California in 2022, Black people accounted for 13% of traffic stops while being only 5% of the state’s population. Such figures seem to cry out “bias.” Because it’s so simple and easy to grasp, this population-vs-stops statistic is often the one community members and the media latch on to while interpreting fairness in policing.

As attention-grabbing as these numbers are, they paint an incomplete – and often misleading – picture. If we truly want fair policing, we have to start by measuring interactions correctly. That means moving beyond surface-level disparities and using rigorous, nuanced metrics that account for context and behavior. Simply put: “X% of population vs Y% of stops” is too crude a yardstick for justice. Below, I will try to outline why this popular metric falls short and how law enforcement leaders can adopt more sophisticated tools – controlled disparities, outcome tests, threshold tests – to get an economist's read on bias and guide meaningful reforms.

The ‘Yardstick’ Problem

The big problem with “X% of the population, Y% of stops” is that it uses the wrong comparison group.

Population-based figures assume everyone is equally likely to be on the road, in the same places, at the same times, and engaging in the same behaviors that trigger stops. That’s not reality. Driving patterns vary by neighborhood, commute routes, and work schedules. If a group is 20% of the city’s residents but makes up 40% of the drivers on a particular highway – or 40% of the people speeding there – you would expect that group to account for roughly 40% of stops on that stretch of road. 

So, the right question is: stops compared to whom? Not the local population, but the people who were actually “in the pool” to be stopped in that specific place and moment – the at-risk population.

Researchers have long warned that residential populations are a poor denominator for stop analysis. Raw stop shares mix two very different things into one number: exposure (who is present in a given location at a given time) and enforcement decisions (how officers choose to act). Moreover, we typically observe only the violations that resulted in a stop, not the full set of violations that happened but were never enforced. Without that missing baseline, it’s easy to mistake differences in exposure or behavior for differences in treatment.

For police leaders, the takeaway is straightforward: if you want to measure fairness, start with the right yardstick. Population percentages are a blunt instrument. Benchmarks grounded in who was actually on the road, what was happening there, and how enforcement was deployed are what make fairness measurement credible – and actionable.

Controlled Disparities: Comparing Like with Like

Choosing the right benchmark is necessary, but not sufficient. Raw average differences in search rates can hide the fact that groups are often stopped in very different contexts. Suppose Group A has a higher search rate than Group B across all stops. It’s tempting to read that as bias against Group A – but it could simply reflect where stops happen. If Group A is mostly stopped in a high-crime area where searches are legitimately more common, the raw gap mixes location effects with treatment differences. A controlled comparison instead asks: within the same neighborhood (and similar circumstances), are A and B searched at different rates? That “like with like” view separates context from differential treatment.

When researchers have applied controls, they often find that disparity narrows, but does not disappear. In his analysis of police use of force, Harvard economist and Sigma Squared Co-Founder Roland Fryer found that adding controls for context and civilian behavior reduced the racial disparities in non-lethal force, though it did not fully explain them. This kind of controlled analysis is far more informative than a blunt population ratio. It helps police leaders answer the question: “are officers using force (or making searches) more often on minority civilians even when the situation is comparable to those involving white civilians?”

The takeaway for police leaders and communities is to dig deeper by analyzing controlled disparities. That approach helps separate outcome gaps driven by differences in policing from gaps driven by differences in the situations officers encounter on the ground. 

Beyond the Surface: Outcome and Threshold Tests

Most importantly, even a controlled disparity doesn’t, by itself, prove bias. To assess bias more directly, we have to rely on models and tests social scientists have developed – such as outcome tests and threshold tests – which probe whether standards and results differ across groups, not just the rate at which actions occur.

Outcome tests ask: Are the outcomes of searches or stops different across groups? In the classic formulation, if officers are searching drivers of different races with the same threshold of suspicion, we would expect those officers to find contraband at roughly equal rates for each group (known as a “hit rate”). If searches of minority drivers yield contraband at a much lower rate than searches of white drivers, it suggests officers might be over-searching minorities on weaker suspicions. Conversely, if a minority group’s search yield is higher than another’s, it could mean officers are setting a higher bar before searching that minority (or that they are under-searching them relative to actual offense rates). A seminal study of highway searches hit upon this idea: economists found that although Black drivers were searched at a much higher rate, the percentage of searches yielding drugs or other contraband was virtually the same for Blacks and whites. This equal “hit rate” led the researchers to conclude that, given the assumptions of their model, the disparity in search rates was not driven by racial prejudice. In theory, a non-discriminatory police force focused purely on efficiency would allocate searches such that success rates equalize across groups. Outcome analyses thus shift the focus from “who gets searched” to “what happens when they are searched,” which can be a more telling indicator of bias in officer decision-making.

Outcome tests help us better understand decision-making bias. But there’s an important limitation called infra-marginality: hit rates can differ even when officers apply the same search standard to everyone.

Here’s a simple numeric example. Suppose officers search anyone they believe has at least a 30% chance of carrying contraband.

  • Group A is a mix of two types of cases: half are 0% risk (almost certainly nothing), and half are 100% risk (almost certainly contraband).

  • Group B is mostly 60% risk cases.

Now apply the same 30% threshold:

  • In Group A, officers will only search the 100% risk half (because 0% is below 30%). That produces a 100% hit rate.

  • In Group B, officers will search essentially everyone (60% is above 30%), but searches succeed only 60% of the time—so the hit rate is 60%.

Notice what happened: officers used the same 30% threshold for both groups, yet Group A’s hit rate looks “better” than Group B’s. That difference doesn’t require bias – it comes from the fact that the underlying mix of risk levels is different across groups.

That’s infra-marginality in action: hit rates reflect both officer decisions and the distribution of risk in the population being searched. So a hit-rate gap can’t, by itself, tell you whether officers are applying different standards.

Scientists developed threshold tests to address this problem. Threshold tests work to infer the actual threshold of suspicion officers are using for each group. Instead of looking only at averages, the threshold test uses machine learning to estimate how likely a stopped individual was to have contraband, and how that probability threshold differed by race. In practical terms, a threshold test asks: “At what estimated probability of the presence of contraband does an officer decide to search, and is that cutoff lower for, say, Black drivers than white drivers?” When applied to 4.5 million police stops in North Carolina, the threshold test revealed something important: the traditional outcome test did not account for infra-marginality and can lead to misleading conclusions.

Police leaders risk being at the mercy of blunt demographic analysis if they are not conversant with these tools. Those that understand them will not settle for raw percentages when their monthly or annual stop data comes in. They’ll ask “What were the outcomes? Were our hit rates equal? Could those outcomes be hiding bias? What threshold of evidence are my officers using for each group?” These questions go far deeper toward diagnosing bias (or proving fairness) than any basic disparity ratio can. They also support more nuanced and productive community conversations about where bias is – and is not – present. 

Conclusion: Measure it Right, Make it Fair

Better measurement isn’t academic, it’s operational. When departments analyze disparities with the right tools, they can pinpoint where gaps are coming from and what to do about them. If controlled comparisons show disparities concentrated in a specific unit, beat, or discretionary traffic enforcement, command staff can focus supervision, training, and policy changes exactly where they matter most. If outcome or threshold tests suggest different standards of suspicion across groups, departments can tighten search guidance, require clearer documentation, and add targeted review for outlier patterns. Done well, measurement turns a vague debate into a practical, evidence-based roadmap – clear leverage points, specific interventions, and a way to track whether changes are actually working.

Fair policing starts with measuring fairness correctly. Population-based comparisons make for easy headlines, but they’re too blunt to diagnose officer decision-making or guide reform. The good news is that better tools exist: methods that control for context, examine outcomes, and estimate decision thresholds so departments and communities can understand what’s really happening. And that clarity matters: if you measure the problem poorly, you risk fixing the wrong thing. The chiefs who deploy effective measurement strategies can lead with credibility, act with precision, and build public trust with transparency. In policing, what gets measured correctly gets managed properly.

Heading to MCCA? Let’s talk fairness in public safety.

Our Co-Founder & Chief Data Scientist, along with members of our executive team, will be on-site throughout the conference to meet with agency leaders and discuss how to evaluate and measure fairness in policing.

👉 Book Time at MCCA


Heading to MCCA? Let’s talk fairness in public safety.

Our Co-Founder & Chief Data Scientist, along with members of our executive team, will be on-site throughout the conference to meet with agency leaders and discuss how to evaluate and measure fairness in policing.

👉 Book Time at MCCA