Someone who had done a candidate gene study which uncovered no evidence for association asked me whether he should perform a power calculation. Yes, candidate gene studies have been widely and justifiably criticised, mostly because of small sample sizes and over-interpretation of results, but in this particular case, a candidate gene study wasn’t so bad – it’s a great biological candidate, impossible to genotype using GWAS chips, and he had a sample size close to an order of magnitude larger than previous studies. But, a post-hoc power calculation? I may have had a slightly over-dramatic reaction.

Power calculations are great. I really like biologists who want to do power calculations without me having to prod them with pointy sticks. But the appropriate time for a power calculation is when a study is designed. They address the question “how big a sample do I need to have a good chance of detecting an effect of a size I believe may exist?” or, alternatively, “if I can collect this many samples, do I have a good chance of detecting an effect of a size I believe may exist?”

If the power is low, then the proposed study is unlikely to reveal anything useful, and effort needs to be put into accessing more samples. But once a study has been completed and analysed, retrospective, post-hoc power calculations should not be done. Ever. ^{1}

There are two kinds of post-hoc power calculations for null studies. One uses the effect size estimated from the data to calculated the “observed power”. As the observed power has a one-to-one relationship with the p value from any study, it should be clear this can add no information to any analysis. The other kind asks “what is the smallest effect size I had power to detect with the sample I collected”. Ignoring that this question should have been asked before the study began, it is now meaningless. If you didn’t find association, does that mean you can rule out the chance that such an effect size exists? Of course not! If you estimated the effect size associated with 80% power as is common, your study could well be in the 20%, how could you tell?

Instead, the data that have been so carefully collected should be used to infer what possible possible underlying effect size can be declared unlikely. The confidence interval is a good place to start. Its definition can seem a little convoluted, as it is based on the frequentist notion of repeated sampling. If you repeated the experiment 100 times, and constructed a 95% confidence interval in the same manner each time, you could expect that in 95 of your 100 experiments the interval would include the true value of the parameter in the population. So, given that any given study is more likely to be in the 95% rather than the 5%, it is reasonable to conclude that the true value of the parameter is unlikely to lie outside the estimated confidence interval. If 5% seems too big, you could always construct a 99% confidence interval.

## Alternatively…

Reading around a little for this post, I found one absolutely fantastic reference ^{2}. This blog post could have just said “Read Hoenig and Heisey to understand why post-hoc power calculations shouldn’t be performed”.

## Footnotes:

^{1} NB I don’t mean that power calculations addressing the sample size needed to replicate the study (detect association again in an independent sample) should be avoided, as they are clearly prospective. I mean it is meaningless to ask “what is the power of this study I have just performed”.

^{2} John M. Hoenig and Dennis M. Heisey The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis The American Statistician 2001 55(1):19-24