What finally convinced me was thinking about how so many studies include p-values as a way of proving one drug is "significantly" better than another or placebo, and don't bother to include effect sizes. This is a great obfuscating tactic, so whoever's conducting these studies must think that people misunderstand p-values. And that makes it worth writing this post.

This is the story of how I clarified p-values to myself.

Let's say we're doing a study comparing two compounds,

Hubba Bubba

and Bubble Yum

to determine which is better at curing the common cold.

We start out with the null hypothesis, which states that we assume there is no difference between the two compounds, or what they can do. If the p-value turns out to be less than 0.05, then we can reject the null hypothesis. I don't like thinking about the null hypothesis because it confuses me. It's like trying to decipher a triple negative. So we're gonna put it aside for now.

We randomize 100 patients to each arm, and follow up the next day, and the day after, with a rating scale, the CQ-7. And this is what we find:

Let's assume we've done all our work honestly and accurately, and we get a p-value less than 0.05. Does this mean that Bubble Yum is significantly better at curing the common cold than Hubba Bubba? It does not. It means we can reject the null hypothesis. But what does THAT mean?

Think of it this way.

Suppose that on the night before the study begins, I sneak into the lab and change the wrappers so that there is no Bubble Yum, only Hubba Bubba. And then suppose we do the study, and we get exactly the same results as above. Can it be? Is it possible that all 100 subjects taking Hubba Bubba wrapped as Bubble Yum got better, and all 100 subjects taking Hubba Bubba wrapped as Hubba Bubba didn't? Yes, it is possible. It's just extremely unlikely. Extremely

**improbable**. How improbable? Well, there's less than a 5% chance that the two compounds could be exactly the same, and yet yield such freakishly different results. That's why the "p" in p-value stands for probability.

In other words, we've rejected the null hypothesis.

Let me repeat. If the p-value is less than 0.05, then there is less than a 5% chance that the null hypothesis is true, i.e. less than a 5% chance that the compounds could be the same and yet yield such disparate results. Which means they're probably not the same. And we choose the significance level to be 0.05, but we could just as easily choose 0.10, or 0.01.

So a very small p-value does not mean that Bubble Yum is significantly better than Hubba Bubba at curing the common cold. It just means that it is extremely unlikely that Bubble Yum could be no better than Hubba Bubba at curing the common cold, with these very different results.

In order to determine how much better Bubble Yum is than Hubba Bubba, you need to look at effect size, and as we have seen any number of times, a small p-value does not imply a large effect size. For example, in the CBT study I recently looked at, p was <0.001, but the effect size was 0.45, only moderate.

This is why many studies leave the effect size out of their publications.