The curse of the p-value

For decades, the quality of statistical results has been measured using the p-value. But this gold standard is not as reliable as you might think. It’s time to question it.

Humanicus
9 min readNov 28, 2021
Photo by Алекс Арцибашев on Unsplash

For a short while Matt Motyl stood in the hall of scientific glory. In 2010, he discovered in an experiment that extremists saw the world in black and white — literally. His results were “crystal clear,” recalls the doctoral student in psychology at the University of Virginia in Charlottesville. His work on 2,000 participants had shown that among them, left-wing or right-wing extremists had a harder time differentiating shades of gray than politically more moderate people. The student was confident and believed in the strength of his results. Publication in a renowned journal seemed within reach. But nothing went as planned.

In fact, the robustness of the statistical results had been calculated by the ritual index, the weapon of all statisticians, the essential p-value. Here it was 0.01, indicating a “very significant” result. Everything was going well. Unfortunately, as a precaution, Motyl and her supervisor, Brian Nosek, repeated the experiment. With new data, the p-value jumps to 0.59, well beyond the 0.05 threshold below which a result is considered significant. So the effect disappeared … and Motyl’s dream of glory. The p-value test condemned the study to oblivion. What if we were dealing with a miscarriage of justice? What if the p-value isn’t as reliable as you might think?

The p-value, this mosquito

In fact, the extremist study did not suffer from a data collection problem or miscalculation. The fault lay much more with the deceptive nature of the p-value itself. The latter is indeed not as reliable and objective as most scientists think. Stephen Ziliak, an economist at Roosevelt University in Chicago and a regular critic of how statistics are used, goes one step further. According to him, “p-values ​​don’t do their job, because they can’t”.

This is worrying, particularly given the concerns about the reproducibility of the results that cross the scientific community, and that the case of Motyl illustrates. John Ioannidis, epidemiologist at Stanford University, said in 2005 that the majority of published results were false and had put forward explanations for this phenomenon. Since then, replication of studies has failed in many famous cases, forcing scientists to reconsider their methods. In other words, is the way the results are evaluated reliable? Are there alternatives, i.e. methods by which all false alarms are identified, without obscuring a real result?

Criticism of the p-value is nothing new. Since its formalization by the British mathematician Karl Pearson at the beginning of the twentieth century, it has been denigrated by comparing it to a “mosquito” (annoying and impossible to get rid of), to “the emperor’s new clothes” (he there is a problem, but nobody talks about it) … Charles Lambdin, of Intel Corporation, even proposed to rename the method “Statistical Hypothesis Inference Testing”, probably for the acronym formed by the initials …

Ironically, when Ronald Fisher (1890–1962), one of the fathers of modern statistics, introduced the p-value in the 1920s, he did not have a test in mind that would definitively determine everything. . He saw it more as a way to judge whether a result was significant, this term being taken in an unscientific sense: the result means something, enough for us to look at it more closely. His idea was to conduct an experiment and then check if the data is consistent with what chance alone would have produced.

To do this, a researcher first formulates a so-called null hypothesis which expresses the reverse of what he wants to prove, for example that individuals who have received a drug are not in better health than those who have received a placebo. . The researcher then assumes that the null hypothesis is true and calculates under this precondition the probability of obtaining results at least as marked as those he has actually observed. This probability is the p-value. The smaller it is, according to Fisher, the greater the probability that the null hypothesis is false, and therefore that it must be rejected, which the researcher wanted at the beginning, because the studied effect is thus confirmed.

We can calculate the p-value with as many digits after the decimal point as we want, but this apparent precision is misleading, because the hypotheses studied are far from being as fine. For Fisher, the p-value was just one link in a chain that connects observations and basic knowledge to reach a scientific conclusion.

These precautions were swept aside by a movement aimed at making evidence-based decision-making as rigorous and objective as possible. The pioneers of this turnaround in the late 1920s were the Polish mathematician Jerzy Neyman and the British statistician Egon Pearson, Fisher’s most bitter rivals. Their system includes concepts like “false positive,” “false negative” (see Benchmarks, page 6), and “power” of a statistical test, which can be found in all beginner statistics courses today. Neyman and Pearson, however, deliberately set aside the p-value.

While representatives of both camps were at odds, other researchers lost patience and wrote statistical textbooks themselves. Unfortunately, many of these writers were not sufficiently versed in the subject matter to appreciate the philosophical subtleties of the two visions. They therefore integrated Fisher’s p-value, easy to calculate, into the regulated, rigorous and secure system of Neyman and Pearson. Since then, from the top of its pedestal, the p-value has always taken pride of place. The value of 0.05 became the threshold between what is significant and what is not. But that was not his role.

Hangover and tumor

As a result, there is great confusion today about what the p-value actually expresses. Motyl’s study is a good illustration of this. Most scientists would interpret its first p-value of 0.01 as a false alarm probability of 1%. This is, however, false. The p-value only gives a rough summary of the data considering a specific null hypothesis to be true.

Another decisive piece of information is missing: the probability that the effect in question actually exists. Leaving it aside is like waking up in the morning with a migraine and blaming a rare brain tumor; this is possible, but rather unlikely, as much more evidence is needed to rule out more common explanations. The less plausible the hypothesis and the more exciting the result, a false alarm in fact will frequently arise, and this completely independent of the p-value.

For the practitioner, such a statement is difficult to apprehend. How should he assess the likelihood that the effect exists? Is it the relative frequency with which it is observed within a set of studies (we speak of frequency interpretation)? Or does such a number reflect the partial knowledge of the researcher, which should be improved by experience only (Bayesian interpretation)? Anyway, before an experiment, a researcher roughly estimates the probability that the hypothesis being studied is true.

By establishing this probability from the start, and by calculating with judicious additional assumptions how the p-values ​​behave under these conditions, the results are then less spectacular. With a prior probability of 50% attributed to Motyl’s hypothesis, the p-value of 0.01 then means: in one out of nine cases, his experience gives him the illusion that there is an effect that does not exist. absolutely not.

Moreover, the probability that his colleagues can reproduce his experiment is not 99%, as one might think, but rather is around 73%, or even only 50%, if one wants a value again. -p of 0.01. In other words, the fact that his second experience was inconclusive is about as surprising as if he had lost to a coin toss.

The effect of effect size

Many critics also feel that the p-value is detrimental to reasoning, particularly because it distracts attention from the actual size of the effect. Let’s take an example. In 2013, a study by John Cacioppo of the University of Chicago of more than 19,000 participants showed that marriages born from online dating were more robust (p <0.002) than others. In addition, individuals who were still married tended to be more satisfied with their marriage than those who had met offline (p <0.001). These values ​​are impressive, but the measured effect was tiny: an internet encounter lowered the separation rate from 7.67 to 5.96% and increased couple satisfaction from 5.48 to 5.64 on a scale of 7. .

Another error, arguably the most serious, lies behind what psychologist Uri Simonsohn of the University of Pennsylvania calls “p-hacking”. This bias consists of manipulating the data until the desired result is obtained, even without bad intentions. Thus, a small change in method during data analysis can increase the false positive rate of a study to 60%.

It is difficult to assess how widespread this problem is, but according to Simonsohn, p-hacking is on the increase, because it is common to look for very small effects in “noisy” data. Analyzing studies in psychology, he found that many published p-values ​​were around 0.05. Is it any wonder that in fact researchers go fishing for significant p-values ​​until one of them they are interested in falls into their net?

With all of this criticism, have things changed? Little. John Campbell, now a psychology researcher at the University of Minnesota at Minneapolis, lamented this already in 1982, when he was editor of the Journal of Applied Psychology: “It is practically impossible to tear authors away from their p-values. . And the more zeros behind the comma, the more they stick to it. “

Every attempt at reform will have to fight firmly established habits: the way statistics are taught in universities, the way in which the results of studies are used and interpreted and how they are then relayed in specialized journals. But at least many scientists have admitted there is a problem. Thanks to researchers like John Ioannidis, the concerns of statisticians are no longer seen as pure theory.

To improve the situation, statisticians have a few tools at their disposal. For example, researchers could always also publish the effect sizes and confidence intervals (see Benchmarks, page 6) obtained. These values ​​express what the p-value alone cannot express: the scope and the relative importance of the effect.

Many also argue for replacing the p-value with methods based on the vision of 18th-century British mathematician Thomas Bayes. This approach describes a way of thinking about probabilities as the plausibility of an outcome, rather than the potential frequency of that outcome. This certainly introduces a subjectivity into statistics, which the pioneers of the early twentieth century wanted to avoid at all costs. Yet Bayesian statistics simplify the integration of contextual knowledge about the world and allow us to calculate how probabilities are changed by newly acquired evidence.

Others take a rather pluralistic approach: scientists should use more than one method on the same set of data. When the results differ, researchers will have to be more creative to find out what it is. Understanding the underlying reality would benefit.

In Simonsohn’s eyes, the best protection for a researcher is to show everything. The authors should guarantee their work “without p-hacking” by explaining the choice of the size of their samples, the data possibly set aside as well as the manipulations implemented. Today, none of this information is available or verifiable.

Studies in two acts

Like the statement “The authors have no financial interests dependent on the content of this article” still common today, this information would make the difference between unintentional or willful scientific misconduct. When such a declaration becomes commonplace, p-hacking will be eliminated. Or at least its absence will be noticed by the reader and allow him a more informed judgment.

Two-step analysis or “preregistered replication” is one idea that works in this direction, and it is gaining ground. In this approach, exploratory and confirmatory studies are approached differently and clearly identified. Instead, for example, of running four small experiments and putting the results together in one article, researchers would first sweep the field for interesting observations with two small exploratory studies, without worrying too much about false alarms. It was only afterwards, on the basis of this data, that they would design a study that would perhaps confirm their results, and would publicly pre-register their intentions in a public database, such as the Open Science Framework. They would then publish their results, along with those of the exploratory study, in an article in a regular journal. Such an approach leaves a lot of freedom, explains political scientist and statistician Andrew Gelman of Columbia University in New York. But it is also sufficiently rigorous to reduce the number of false discoveries.

More generally, the time has come for scientists to realize the limits of conventional statistics. Above all, a serious scientific estimate of the plausibility of the results should be made as soon as they are analyzed: what results have been provided by similar research? Is there a mechanism that could explain them? Do the results overlap with clinical experience? These are the decisive questions! By responding to them in future work, Motyl can still hope to achieve the success it hopes for!

--

--

Humanicus
Humanicus

Written by Humanicus

Please follow me since now we need 100 min follower on medium

No responses yet