If we wanted to know which taxa were associated with a certain variable in our dataset, we could conduct a test for each taxa, one by one, and check the p-value. All the ones with low p-values (<0.05) we would call significantly associated with our variable. The problem is that the p-value is misleading when conducting multiple statistical tests. If the Null Hypothesis (H0) is true the p-value is expected to be anywhere between 0 and 1 (uniform distribution). As such, if enough statistical tests are conducted, some (~5%) of the p-values will, by chance, be lower than 0.05, even if there are no true associations in the data.
Therefore, when conducting multiple tests it is best practice to adjust the p-values.
Getting in-depth:
Let's try to simulate some random data, and see how the p-value look like.
# We use the rnorm() function to generate random normally distributed data with a mean of 0 and standard deviation of 1
# Here is an example with 20 random (normallly distributed) numbers:
rnorm(20)
If we do a t-test comparing two randomlly created sets of numbers, they should not be significantly different (at least 95% of the time).
# Set seed (for reproducibility)
set.seed(42)
t.test(rnorm(20), rnorm(20))
As expected, not significant. However, let's try to do this 10000 times, and plot a histogram of the p-values:
# Set seed (for reproducibility)
set.seed(42)
# 10000 t-tests with random data
random_ttests <- sapply(1:10000, function(x) t.test(rnorm(20), rnorm(20))$p.value)
# Histogram
hist(random_ttests, breaks = 20)
Let's count how many of the p-values are below 0.05:
sum(random_ttests <= 0.05)/length(random_ttests)
Around 5% of the p-values are lower than 0.05
The most common way to adjust the p-values for this problem in the microbiome field is to control the False Discovery Rate (FDR) with method developed by Benjamini and Hochberg (BH), see details on this method in the paper linked above. Assuming a cutoff of 0.05 for assigning statistical significance, then for the orginal p-values we accept a 5% chance of falsely denoting a single test significant even though it is not. With the FDR correction we accept that 5% of all the tests we call significant are not truely significant.
In R:
adjusted_pvals <- p.adjust(random_ttests, method = "BH")
# Histogram
hist(adjusted_pvals)
All adjusted p-values are close to 1 (not significant)
# Set seed (for reproducibility)
set.seed(4)
# Low p-value
ttest_sig <- t.test(rnorm(20, mean = 0), rnorm(20, mean = 2))$p.value
# Append to the 10000 p-values
random_ttests_plus1 <- c(random_ttests, ttest_sig)
# Adjust p-values
adjusted_pvals_plus1 <- p.adjust(random_ttests_plus1, method = "BH")
# Print
print("Original p-value:")
ttest_sig
print("FDR-adjusted p-value:")
adjusted_pvals_plus1[length(adjusted_pvals_plus1)]
The p-value is higher, but still significant.
A simpler correction than BH is the Bonferroni correction. It is rarely used in the microbiome field, as it is much more strict and less powerful, but it is common in the field of Genome-wide association studies (GWAS).
FWER is the abbreviation for Familywise error rate, and it is the rate at which we find at least 1 false positive among a series of hypothesis tests. With Bonferroni we simply multiply all p-values with the number of tests that were conducted
In R:
# Adjust p-values
adjusted_pvals_plus1_FWER <- p.adjust(random_ttests_plus1, method = "bonferroni")
print("FWER-adjusted p-value:")
adjusted_pvals_plus1_FWER[length(adjusted_pvals_plus1_FWER)]
In this rare case with only 1 highly significant test among of series of random tests, the FDR and FWER correction results in the same adjusted p-value, but in most real-world cases the FWER correction will result in higher adjusted p-values.