A crisis has recently been described in the reproducibility of studies reported in leading science journals.1–6 One of the multiple origins of this crisis is that significant statistical results obtained in some studies were not replicated or supported in other studies.7–9 One of the reasons given is that researchers often wrongly generalize their results to the population of interest (ie, the target population) after obtaining a significant result in their study sample.10,11 P-hacking and HARKing (where HARK stands for “hypothesis after result is known”) are inappropriate methods of analyzing and interpreting study findings that lead to false-positive results. P-hacking (also called P-fishing) occurs when researchers collect data without a predetermined sample size, select data without a priori identification of inclusion and exclusion criteria, or select statistical test approaches until nonsignificant results become significant.12–15 HARKing occurs when researchers present a post hoc hypothesis based on their results as if it were an a priori hypothesis.16 However, even in the absence of P-hacking, HARKing, biases,17 or any other errors in scientific reporting, the probability of wrongly inferring that obtained significant results apply to the target population is high in some situations.
The purpose of this article is to help researchers in clinical veterinary science and clinicians who interpret research findings appreciate the degree of certainty or uncertainty when generalizing the study results to the target population of studied animals, according to the characteristics of the clinical study. This appreciation is also important in the practice of evidence-based veterinary medicine,18 particularly when critically appraising the evidence within the more general framework of the clinical decision-making process.19,20 To this end, 2 examples of hypothetical studies (each considered feasible and ethical for simplification) will be used. Statistical and diagnostic test concepts will be reviewed to demonstrate the similarity between diagnostic test interpretation and statistical test interpretation and to illustrate the calculation of the probability of an incorrect conclusion after a significant association has been obtained.
The author thanks Professor Fanny Storck and Dr. Elodie Darnis for their helpful comments and Dr. Jeremy Beguin for assistance in finding illustrative examples in the field of oncology.
9. Begley CG, Ioannidis JP. Reproducibility in science: improving the standard for basic and preclinical research. Circ Res 2015;116:116–126.
12. Greenland S. Multiple comparisons and association selection in general epidemiology. Int J Epidemiol 2008;37:430–434.
13. Guller U, DeLong ER. Interpreting statistics in medical literature: a vade mecum for surgeons. J Am Coll Surg 2004;198:441–458.
14. Head ML, Holman L, Lanfear R, et al. The extent and consequences of p-hacking in science. PLoS Biol 2015;13:e1002106.
19. Vandeweerd JM, Kirschvink N, Clegg P, et al. Is evidence-based medicine so evident in veterinary research and practice? History, obstacles and perspectives. Vet J 2012;191:28–34.
20. White BJ, Larson RL. Systematic evaluation of scientific research for clinical relevance and control of bias to improve clinical decision making. J Am Vet Med Assoc 2015;247:496–500.
22. West CP, Dupras DM. 5 ways statistics can fool you. Tips for practicing clinicians. Vaccine 2013;31:1550–1552.
23. Mullin CM, Arkans MA, Sammarco CD, et al. Doxorubicin chemotherapy for presumptive cardiac hemangiosarcoma in dogs. Vet Comp Oncol 2016;14:e171–e183.
24. Holtermann N, Kiupel M, Kessler M, et al. Masitinib monotherapy in canine epitheliotropic lymphoma. Vet Comp Oncol 2016;14(suppl 1):127–135.
25. Lehmann EL. The Fisher, Neyman-Pearson theories of testing hypotheses: one theory or two? J Am Stat Assoc 1993;88:1242–1249.
27. Jeffery N. Liberating the (data) population from subjugation to the 5% (P-value). J Small Anim Pract 2015;56:483–484.
29. Amrhein V, Greenland S, McShane B. Scientists rise up against statistical significance. Nature 2019;567:305–307.
31. White BJ, Larson RL, Theurer ME. Interpreting statistics from published research to answer clinical and management questions. J Anim Sci 2016;94:4959–4971.
32. Browner WS, Newman TB. Are all significant P values created equal? The analogy between diagnostic tests and clinical research. JAMA 1987;257:2459–2463.
33. Greenland S. Bayesian perspectives for epidemiological research: I. Foundations and basic methods. Int J Epidemiol 2006;35:765–775.
34. Lash TL. The harm done to reproducibility by the culture of null hypothesis significance testing. Am J Epidemiol 2017; 186:627–635.
35. Wacholder S, Chanock S, Garcia-Closas M, et al. Assessing the probability that a positive report is false: an approach for molecular epidemiology studies. J Natl Cancer Inst 2004;96:434–442.
36. Held L. Reverse-Bayes analysis of two common misinterpretations of significance tests. Clin Trials 2013;10:236–242.
37. Goodman SN. P values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. Am J Epidemiol 1993;137:485–496, discussion 497–501.
38. Gliner JA, Leech NL, Morgan GA. Problems with null hypothesis significance testing (NHST): what do the textbooks say? J Exp Educ 2002;71:83–92.
39. Greenland S, Senn SJ, Rothman KJ, et al. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol 2016;31:337–350.
41. Wasserstein RL, Lazar NA. The ASA's statement on p-values: context, process, and purpose. Am Stat 2016;70:129–133.
44. Trafimow D, Amrhein V, Areshenkoff CN, et al. Manipulating the alpha level cannot cure significance testing. Front Psychol 2018;9:699.
47. Colquhoun D. The reproducibility of research and the misinterpretation of p-values (Erratum published in R Soc Open Sci 2018;5:180100). R Soc Open Sci 2017;4:171085.