How often have you been sitting in a reproductive medicine meeting, listening to the presentation of a new and promising study, and have heard the speaker proudly present the results: "And the p-value is 0.04!" ?
![](https://static.wixstatic.com/media/9ae0b1_527b01f9f28e49e596abf81cf8acd2a5~mv2.jpg/v1/fill/w_486,h_864,al_c,q_85,enc_auto/9ae0b1_527b01f9f28e49e596abf81cf8acd2a5~mv2.jpg)
Usually, the room nods approvingly—until someone raises their hand and asks, "But... what about the sample size?"
In the face of very little sufficiently powered research, we seem to have developed a case of p-value worship. Focusing on the p-value while ignoring key factors like sample size, effect size, or clinical relevance is an unfortunate if frequent mistake we repeat in ART. The allure of a “statistically significant” result can be hard to resist, especially when it's under that magic threshold of 0.05. But here’s the kicker—just because the p-value is low, doesn’t mean the results are meaningful.
Who came up with this number, anyway? Well, it was Ronald Fisher, who according to Wikipedia, was a ‘genius who single handedly created the foundations of modern statistical science’ who set the p = 0.05 threshold. But how did he come up with this number?
Legend has it that he was fond enough of placing wagers on horse races to know that an odds of 1 in 20 (ie, 0.05) was the number below which bookies thought an event might happen, but was not probable. So it comes from the gut feeling of the bookies. A safe bet. But a bet nonetheless. More reckless or more cautious players might favor higher or lower odds.
![](https://static.wixstatic.com/media/9ae0b1_f59d2b27cb0f45f49b4b1dbc8d973224~mv2.webp/v1/fill/w_900,h_600,al_c,q_85,enc_auto/9ae0b1_f59d2b27cb0f45f49b4b1dbc8d973224~mv2.webp)
However, 0.05 does have a couple of things going for it. Fisher's choice of 0.05 as a threshold was somewhat arbitrary but pragmatic, as it was small enough to avoid frequent false positives (Type I errors) but not so stringent as to miss real effects (Type II errors). But who is it to say what 'frequent' means, or what the threshold for ‘miss’ is? It also coincides approximately with 2 standard deviations, which is the critical value for rejecting the null hypothesis in a normal two tailed normal distribution.
Anyway, we have to use something, and apparently, bookies know best.
But… but…. Imagine you're testing a new intervention in a trial with just 20 patients. You manage to squeeze out a p-value of 0.04, and it’s tempting to say, "It works!" But let’s face it: with such a small sample, are you really confident that this result isn’t just a fluke? A low p-value in a small study often means the result could easily flip in the opposite direction with a slightly different group of patients. And don’t even get me started on the potential for false positives.
Not to mention the difference between statistical difference and clinical meaningful difference! Suppose we test two superovulation regimes, A and B, and measure the number of oocytes obtained with each treatment. We might find ourselves in the situation of having a statistically significant difference at p<0.05, or even p< 0.000005 but find that treatment A gives us an average of 5.0004 oocytes per patient, while treatment B gives us an average of 5.0005 oocytes per patient. Not much to crow about, is it?
The moral of the story? A p-value is just one tool in the statistical toolbox. By itself, it’s like a single ingredient in a recipe—without the rest (like sample size, confidence intervals, and clinical impact), the result is bland, and possibly even misleading. So, next time you find yourself tempted to celebrate a p-value alone, remember: it’s not the only star of the show.
Comments