11.7. Criticisms of Null Hypothesis Testing

11.7. Criticisms of Null Hypothesis Testing

By Rajiv S. Jhangiani, I-Chant A. Chiang, Carrie Cuttler and Dana C. Leighton, adapted by Marc Chao and Muhamad Alif Bin Ibrahim

Null hypothesis testing is a cornerstone of modern research methodology, yet it has faced increasing scrutiny due to its conceptual and practical limitations. These criticisms range from widespread misunderstandings of key concepts, such as the p-value, to fundamental concerns about the method’s logic and utility in drawing meaningful conclusions. Despite its shortcomings, null hypothesis testing continues to be defended and widely used, while alternative approaches are gaining traction in research communities.

Misinterpretations of the p-Value

One of the most pervasive issues with null hypothesis testing is the misinterpretation of the p-value. Many researchers erroneously believe that the p-value represents the probability that the null hypothesis is true. In reality, the p-value indicates the likelihood of observing a sample result as extreme as the one obtained, assuming the null hypothesis is true. This misunderstanding often leads to overconfidence in research findings and misinformed conclusions. A related misconception involves the belief that 1 − p equates to the probability of successfully replicating a significant result. For example, a study by Oakes (1986) revealed that 60% of professional researchers incorrectly thought a p-value of 0.01 in an independent-samples t-test (with 20 participants per group) implied a 99% chance of replication. This is far from accurate. Statistical power, which is a measure of the likelihood of detecting a true effect, shows that even with a large population effect, replicating a result with 99% probability requires considerably larger sample sizes than typically used. For instance, achieving 80% power for a medium effect size requires 26 participants per group, while 99% power demands 59 participants per group. These figures highlight how reliance on p-values can create misleading expectations about replicability.

Criticism of Rigid p-Value Thresholds

Another concern is the strict reliance on the p < 0.05 threshold to determine statistical significance. This rigid boundary often leads to arbitrary distinctions between “significant” and “non-significant” results. For example, two studies with nearly identical findings, one with p = 0.04 and another with p = 0.06, might be judged very differently. The former could be viewed as important and publishable, while the latter might be dismissed. This convention not only stifles valuable research but also exacerbates problems like the file drawer issue, where non-significant results remain unpublished.

Limitations of Null Hypothesis Testing

A deeper criticism questions the fundamental logic of null hypothesis testing. Rejecting the null hypothesis merely indicates that there is some nonzero relationship in the population, without specifying the strength or nature of that relationship. This lack of precision is seen as uninformative. Critics argue that, in many cases, the null hypothesis (e.g., d = 0 or r = 0) is unlikely ever to be strictly true, as any relationship, however minute, will deviate from zero if measured with enough precision. Consequently, rejecting the null hypothesis may reveal little that is not already assumed. For example, this would be akin to a chemist only determining that temperature affects gas volume without providing a detailed equation to describe the relationship.

Defences of Null Hypothesis Testing

Despite these criticisms, null hypothesis testing has defenders. Robert Abelson (1995) argued that when properly understood and executed, it provides a robust framework for research. Particularly in new areas of study, null hypothesis testing offers a systematic way to demonstrate that results are not merely due to chance, lending credibility to new findings.

The End of p-Values?

In 2015, the editors of Basic and Applied Social Psychology announced a ban on null hypothesis testing and related statistical procedures (Tramifmow & Marks, 2015). Authors were still permitted to include p-values in their submissions, but the editors committed to removing them before publication. While they did not suggest an alternative statistical method to replace null hypothesis testing, they emphasised the importance of relying on descriptive statistics and effect sizes instead.

Although this decision has not been widely adopted in the broader research community, it sparked significant discussion. By challenging the long-standing “gold standard” of statistical validity, the editors invited psychologists to critically reconsider how knowledge is established and communicated within the field. This debate continues to influence conversations about the role and limitations of statistical methods in scientific research.

What Should Be Done?

Even supporters of null hypothesis testing acknowledge its flaws, but what can researchers do to address these issues? The APA Publication Manual offers several recommendations to improve the practice.

One suggestion is to accompany every null hypothesis test with an effect size measure, such as Cohen’s d or Pearson’s r. This addition provides an estimate of the strength of the relationship in the population, rather than simply indicating whether a relationship exists. This is important because a p-value alone cannot measure relationship strength, as it is influenced by sample size. For instance, even a very weak relationship can appear statistically significant if the sample size is large enough.

Another recommendation is to use confidence intervals instead of null hypothesis tests. A confidence interval represents a range of values that is likely to include the population parameter a certain percentage of the time (usually 95%). For example, if a sample of 20 students has a mean calorie estimate of 200 for a chocolate chip cookie with a 95% confidence interval of 160 to 240, there is a 95% chance that the true population mean lies within this range. Confidence intervals are often easier to interpret than null hypothesis tests and provide useful information for those who still wish to conduct such tests. For instance, in the example above, the sample mean of 200 is statistically significantly different at the .05 level from any hypothetical population mean outside the confidence interval, such as 250.

Finally, more radical solutions propose replacing null hypothesis testing altogether. Bayesian statistics is one such alternative. In this approach, researchers assign initial probabilities to the null hypothesis and alternative hypotheses before conducting a study, then update these probabilities based on the observed data. While Bayesian methods are gaining attention, it is too soon to determine if they will become standard in psychological research. For now, null hypothesis testing, which is enhanced by effect size measures and confidence intervals, remains the predominant method.

References

Abelson, R. P. (1995). Statistics as principled argument. Lawrence Erlbaum Associates.

Oakes, M. W. (1986). Statistical inference: A commentary for the social and behavioural sciences. Wiley.

Trafimow, D., & Marks, M. (2015). Editorial. Basic and Applied Social Psychology, 37(1), 1–2. https://doi.org/10.1080/01973533.2015.1012991

Chapter Attribution

Content adapted, with editorial changes, from:

Research methods in psychology, (4th ed.), (2019) by R. S. Jhangiani et al., Kwantlen Polytechnic University, is used under a CC BY-NC-SA licence.

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

11.7. Criticisms of Null Hypothesis Testing Copyright © 2025 by Marc Chao and Muhamad Alif Bin Ibrahim is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.