Statistical Review in Peer Review: Common Errors and Red Flags
Statistical validity is one of the most frequently compromised—and most frequently overlooked—dimensions of scientific peer review. Reviewers trained primarily in a biological or clinical discipline may lack the quantitative background to catch fundamental errors in study design, data analysis, or reporting. The result is that statistically flawed research passes through peer review and enters the scientific record, where it shapes clinical practice, policy, and future research. Understanding where these failures occur, why they persist, and what rigorous statistical review looks like is essential for anyone engaged in the review process.
Why Statistical Errors Persist in Published Research
The peer review system was not originally designed with formal statistical oversight in mind. When the Royal Society established early review practices in the seventeenth century—a history traced in detail on the history of peer review page—the empirical sciences had not yet developed the probabilistic frameworks that now underlie virtually all experimental research. Today, most journals assign manuscripts to subject-matter experts who may have no formal training in biostatistics.
A 2015 analysis published in PLOS ONE examined papers in medical journals and found that approximately half contained at least one statistical error significant enough to affect the study's conclusions. A separate audit published in BMJ estimated that up to 40% of papers in high-impact clinical journals contained errors in the application of basic inferential statistics. These are not anomalies. They reflect a structural gap between what peer review is asked to evaluate and who is typically asked to evaluate it.
The International Committee of Medical Journal Editors (ICMJE), whose Recommendations for the Conduct, Reporting, Editing, and Publication of Scholarly Work in Medical Journals constitute the closest thing to a universal standard in biomedical publishing, explicitly states that editors should seek statistical review for papers where quantitative methods are central to the conclusions. Many journals claim compliance with ICMJE guidance but do not consistently apply this recommendation.
The Most Common Statistical Errors Reviewers Should Catch
Several categories of error appear repeatedly across the literature. Reviewers—and authors—should treat these as primary checkpoints.
Misuse of p-values and significance thresholds. The p-value remains widely misinterpreted. A p-value below 0.05 does not indicate that a result is clinically meaningful, that a finding is likely to replicate, or that the null hypothesis is false. It indicates only that, under the null hypothesis, data at least as extreme as those observed would occur less than 5% of the time. The American Statistical Association (ASA) issued a formal statement in 2016 explicitly warning against treating p < 0.05 as a binary threshold for scientific truth. Reviewers should look for effect sizes, confidence intervals, and contextual interpretation alongside any significance test.
Inadequate sample size and underpowered studies. A study may be correctly analyzed and still produce misleading results if it lacks sufficient statistical power to detect the effect of interest. Underpowered studies are prone to both false negatives (missing real effects) and, counterintuitively, inflated effect size estimates when positive results do emerge. Reviewers should expect a power calculation to be reported in the methods section. If none is present, the absence itself warrants comment.
Multiple comparisons without correction. When researchers test multiple hypotheses on a single dataset—whether across multiple endpoints, subgroups, or time points—the probability of a false positive increases substantially. A study testing twenty independent hypotheses at p < 0.05 expects one false positive by chance alone. Reviewers should ask whether corrections such as Bonferroni adjustment, Benjamini-Hochberg false discovery rate control, or pre-registered hypothesis selection were applied and justified.
Inappropriate choice of statistical test. Applying a parametric test to non-normally distributed data, using repeated-measures designs without accounting for correlation, or treating ordinal data as continuous are common technical errors. Each can distort results in ways that are not obvious from reading a results section. Reviewers without a strong quantitative background should flag uncertainty rather than assume the analysis is correct.
Selective reporting and outcome switching. This overlaps with peer review ethics concerns and is discussed in that context elsewhere. From a purely statistical standpoint, the practice of reporting only significant outcomes from a broader analytic plan—sometimes called p-hacking or data dredging—produces a literature that systematically overstates effect sizes and replication rates.
What Formal Statistical Review Actually Involves
Journals that take statistical rigor seriously often employ dedicated statistical reviewers or require that manuscripts pass through a biostatistician before a final acceptance decision is made. The BMJ has maintained a statistical review process for decades and publishes its statistical review criteria publicly. Nature and Science have both updated their statistical reporting requirements in response to reproducibility concerns, requiring that raw data, analysis code, and methods descriptions meet higher standards of transparency.
A qualified statistical reviewer will examine the correspondence between the stated hypotheses, the analysis plan, and the reported results. They will verify that assumptions underlying each statistical test are documented and addressed. They will assess whether the figures and tables accurately represent the underlying data and whether summary statistics are reported with appropriate precision and context.
For manuscripts in the life sciences specifically, the EQUATOR Network—a collaborative body that maintains reporting guidelines for health research—provides standards such as CONSORT (for randomized trials), STROBE (for observational studies), and ARRIVE (for animal research). These checklists include explicit statistical reporting requirements and give reviewers a structured framework for evaluation.
Red Flags That Should Prompt Deeper Scrutiny
Some patterns in a manuscript warrant elevated skepticism even before a detailed statistical review is conducted. Results that are uniformly statistically significant across multiple endpoints are unusual in most biological systems and may indicate selective reporting. Effect sizes that are implausibly large relative to the existing literature deserve explanation. Absence of variance reporting—presenting only means or medians without standard deviations or ranges—prevents the reader from assessing distributional assumptions or replication feasibility.
Reviewers should also attend to whether the statistical methods described in the methods section match those reported in the results. Discrepancies sometimes reflect honest drafting errors; they can also indicate post-hoc selection of the analysis that produced the most favorable outcome. The peer review ethics framework addresses what to do when a reviewer suspects intentional manipulation.
For readers evaluating whether a published paper's conclusions are sound, the peer review metrics page provides useful context for understanding how impact factor and citation counts can mask poor statistical quality in published work.
Seeking Qualified Statistical Guidance
Reviewers who encounter manuscripts beyond their statistical competence have both a professional obligation and a practical option: they can request that the editor assign a co-reviewer with statistical expertise, or they can decline the review and note the reason. This is not a failure of responsibility—it is the responsible course. Accepting a review and offering uninformed statistical approval is significantly worse than acknowledging a limitation.
Authors uncertain about their own analysis before submission should consult a biostatistician during the study design phase, not after data collection. Many universities and research institutions have statistical consulting services. The ASA maintains a directory of accredited statisticians, and the Royal Statistical Society offers guidance on finding qualified practitioners in the United Kingdom.
Understanding what peer review is structurally designed to catch—and what it frequently misses—is fundamental to interpreting the scientific literature with appropriate skepticism. For a broader orientation to how the review process works before statistical scrutiny even begins, see the how it works and types of peer review pages.
External references:
- International Committee of Medical Journal Editors (ICMJE): icmje.org
- American Statistical Association (ASA) Statement on *p*-Values (2016): amstat.org
- EQUATOR Network Reporting Guidelines: equator-network.org
References
- Tufts Center for the Study of Drug Development
- Association of American Universities COI framework
- A Framework for K–12 Science Education (National Research Council, 2012)
- National Research Council, A Framework for K–12 Science Education (2012) — National Academies Press
- Presidential Commission for the Study of Bioethical Issues
- Karl Popper, The Logic of Scientific Discovery (1959) — Stanford Encyclopedia of Philosophy summary
- Association of American Universities
- unsolved as of the current state of seismology research