Statistics summary

This notebook is a brief introduction to various statistical analyses commonly used for amplicon data analysis - including some basic theory, links to more in-depth explanations, how to run them in R, and how to ensure the assumptions hold.

Getting more in-depth: Excellent introductions to various statistical concepts and analysis can be found in a Nature Methods collection of 2-page letters aimed at biologists.

The first paper, Importance of being uncertain, is a must read for anyone new to statistics.

In [1]:
# Load packages
library(ggplot2) # Plotting

Example datasets

mtcars: The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles

In [2]:
data(mtcars)

pairedData: A data.frame with randomly generated paired data

In [3]:
# Set seed (for reproducibility when doing things that involves randomness)
set.seed(42)
# Set number of samples
nSamples <- 10
# Build data.frame with random measurements
pairedData <- data.frame(sampletime = c(rep("before", nSamples), rep("after", nSamples)),
                        subject = rep(LETTERS[1:nSamples], 2),
                         measurement = rnorm(nSamples * 2))
# Shuffle data.frame
pairedData <- pairedData[sample(1:nrow(pairedData)), ]

Two sample T-test

The two-sample t-test is used for comparing the means of two samples; for example comparing a control condition with a treatment.

Assumptions

  1. Samples are uncorrelated. This is the most crucial assumption: Samples should be independent. If the samples are paired, for example with before and after measurements, or in other ways correlated one-to-one, use the paired t-test.

  2. Samples come from a population that have an approximately normal shape. It is quite robust to deviations, but will have higher chance of false positives/negatives with highly non-normal distributions. Transformations (e.g. log transform) can sometimes solve this problem.

  3. Variance is equal in the two samples. The t-test is however robust to deviations from this assumption. In R the default t-test is the Welch variant, which solves the problem of unequal variances from unequal sample sizes.

In R

Let's compare the engine displacement of cars with automatic transmission (am = 1) compared to manual transmission (am = 0)

First plot boxplots:

In [4]:
ggplot(mtcars, aes(factor(am), disp)) + geom_boxplot()

Then run the t-test:

In [5]:
t.test(disp ~ am, data = mtcars)
	Welch Two Sample t-test

data:  disp by am
t = 4.1977, df = 29.258, p-value = 0.00023
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
  75.32779 218.36857
sample estimates:
mean in group 0 mean in group 1 
       290.3789        143.5308 

We can see that the average displacement in cars with manual transmission (group 0) is 290, and in cars with automatic transmission (group 1) is 144. This difference is significant with a p-value of 0.00023. The difference between the two is between 75 and 218 (95% confidence interval)

Wilcoxon Rank Sum test

The Wilcoxon Rank Sum test is used to compare the medians of two samples, similar to a two sample t-test. This test is however non-parametric and therefore has (almost) no assumptions on the distributions of the populations. It is less powerful than the t-test, especially for small (n<10) sample sizes.

Assumptions

  1. Similar to the t-test, the samples should be uncorrelated and independent. A paired version of this test is the Wilcoxon Signed Rank test.

  2. The two samples should come from populations with approximately similar shape.

In-depth paper on non-parametric tests

In R:

Let's compare the engine displacement of cars with automatic transmission (am = 1) compared to manual transmission (am = 0)

In [6]:
wilcox.test(disp ~ am, data = mtcars)
Warning message in wilcox.test.default(x = c(258, 360, 225, 360, 146.7, 140.8, 167.6, :
"cannot compute exact p-value with ties"
	Wilcoxon rank sum test with continuity correction

data:  disp by am
W = 214, p-value = 0.0005493
alternative hypothesis: true location shift is not equal to 0

We can see that the average displacement in cars with manual transmission (group 0) is different than in cars with automatic transmission (group 1) a p-value of 0.0005493. The warnings tells us that some displacement values are similar, which means the p-value is an approximation. This is not problematic unless there are many ties.

Paired two sample tests

If samples are paired, a paired test should be used. This should be used with for example before and after measurements on the same subjects. See the above two sections for assumptions.

In R it is crucial that the observations are in correct order, as the samples are paired according to the input order.

Ordering data in R

If we look at the dataset, we can see that it is not ordered according to the subject, as it should:

In [7]:
pairedData
A data.frame: 20 × 3
sampletimesubjectmeasurement
<chr><chr><dbl>
11after A 1.30486965
15after E-0.13332134
8beforeH-0.09465904
4beforeD 0.63286260
17after G-0.28425292
6beforeF-0.10612452
2beforeB-0.56469817
13after C-1.38886070
12after B 2.28664539
5beforeE 0.40426832
16after F 0.63595040
14after D-0.27878877
18after H-2.65645542
9beforeI 2.01842371
3beforeC 0.36312841
1beforeA 1.37095845
20after J 1.32011335
7beforeG 1.51152200
19after I-2.44046693
10beforeJ-0.06271410

Let's order it by subject:

In [8]:
pairedDataOrder <- pairedData[order(pairedData$subject), ]
pairedDataOrder
A data.frame: 20 × 3
sampletimesubjectmeasurement
<chr><chr><dbl>
11after A 1.30486965
1beforeA 1.37095845
2beforeB-0.56469817
12after B 2.28664539
13after C-1.38886070
3beforeC 0.36312841
4beforeD 0.63286260
14after D-0.27878877
15after E-0.13332134
5beforeE 0.40426832
6beforeF-0.10612452
16after F 0.63595040
17after G-0.28425292
7beforeG 1.51152200
8beforeH-0.09465904
18after H-2.65645542
9beforeI 2.01842371
19after I-2.44046693
20after J 1.32011335
10beforeJ-0.06271410

Paired t-test in R

In [9]:
before <- pairedDataOrder[pairedDataOrder$sampletime == "before", "measurement"]
after <- pairedDataOrder[pairedDataOrder$sampletime == "after", "measurement"]

t.test(after, before, paired = TRUE)
	Paired t-test

data:  after and before
t = -1.0742, df = 9, p-value = 0.3107
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.2075921  0.7860851
sample estimates:
mean of the differences 
             -0.7107535 

The average difference between before and after measurements is -0.71 (-2.21 to 0.79, 95% CI) and this is not statistically significant, p-value = 0.3107.

Also notice that we only have 9 degrees of freedom, despite having 20 observations in the input data.frame. This is because we have repeated measurements within subjects, and we actually only have 10 independent measurements. A paired t-test is therefore equivalent to finding the difference between before and after measuremnts and then running a one-sample t-test.

Let's try the same analysis with a one-sample t-test:

In [10]:
# Find the difference
difference <- after - before
difference
  1. -0.0660887929231833
  2. 2.8513435640972
  3. -1.75198911244968
  4. -0.911651371778412
  5. -0.537589659534657
  6. 0.742074914161558
  7. -1.79577491885501
  8. -2.56179638249168
  9. -4.45889064245256
  10. 1.38282744478261
In [11]:
# One-sample t-test
t.test(difference, mu = 0)
	One Sample t-test

data:  difference
t = -1.0742, df = 9, p-value = 0.3107
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -2.2075921  0.7860851
sample estimates:
 mean of x 
-0.7107535 

This result is exactly identical to the paired two-sample test.

Correlation

It is advised to read this excellent introduction on the difference between correlation, association, and causation.

As the above paper states: two variables are correlated when they display an increasing or decreasing trend

Correlation values range between 1 for perfect positive trend, 0 for no trend, to -1 for perfect negative trend. There are two types of correlation metrics that are typically used. Pearson's r and and Spearman's rho. Pearsons is used for linear trends. See collection of scatter plots and associated pearson correlations:

Pearson and Spearmans is used for monotonic trends. See the below graph from the Wikipedia page on Spearmans correlation: Pearson vs Spearman

Assumptions

Pearsons assume linearity between variables and that both variables are (somewhat) normally distributed and homoscedastic. See this section for details.

Spearmans assumes a monotonic trend between variables. A non-monotonic trend is for example a U-shape or ∩-shape when plotting the two variables in a scatter plot as above.

In R:

Pearson's is calculated with the cor function. Here we find the correlation between horsepower and engine displacement from the car data:

In [12]:
cor(mtcars$disp, mtcars$hp)
0.790948586369806

The order of the variables does not matter:

In [13]:
cor(mtcars$hp, mtcars$disp)
0.790948586369806

Let's try the same with Spearman's correlation:

In [14]:
cor(mtcars$disp, mtcars$hp, method="spearman")
0.851042626992907

Spearman is basically a Pearson's correlation on ranked data:

In [15]:
cor(rank(mtcars$disp), rank(mtcars$hp))
0.851042626992907

Let's make a plot to visualize the trend between the two variables:

In [16]:
plot(mtcars$disp, mtcars$hp)

We can get p-values by using the cor.test function with the same input:

In [17]:
cor.test(mtcars$disp, mtcars$hp)
	Pearson's product-moment correlation

data:  mtcars$disp and mtcars$hp
t = 7.0801, df = 30, p-value = 7.143e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6106794 0.8932775
sample estimates:
      cor 
0.7909486 
In [18]:
cor.test(mtcars$disp, mtcars$hp, method="spearman")
Warning message in cor.test.default(mtcars$disp, mtcars$hp, method = "spearman"):
"Cannot compute exact p-value with ties"
	Spearman's rank correlation rho

data:  mtcars$disp and mtcars$hp
S = 812.71, p-value = 6.791e-10
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.8510426 

Linear regression

In a simple linear regression we seek to find how changes in one continuous variable (X) impacts the estimation of another continuous variable (Y). In contrast to correlation, linear regression is asymmetrical, such that swapping X and Y will give a different result (the p-value would be the same however). It is based on the assumption that X is fixed (measured perfectly) and that Y is random (measured with error). Furthermore, it is based on the assumption that the residuals (distance between actual data points and the statistical model) are normally distributed. As with Pearsons correlation there is also an assumption of linearity and homoscedasticity.

In-depth paper in linear regression

In R:

Linear regression with engine displacement as indepedent (explanatory) variable and horsepower as dependent (response) variable.

In [19]:
# Save the model in the fit object
fit <- lm(hp ~ disp, data = mtcars)

# Get a summary of the model:
summary(fit)
Call:
lm(formula = hp ~ disp, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-48.623 -28.378  -6.558  13.588 157.562 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  45.7345    16.1289   2.836  0.00811 ** 
disp          0.4375     0.0618   7.080 7.14e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 42.65 on 30 degrees of freedom
Multiple R-squared:  0.6256,	Adjusted R-squared:  0.6131 
F-statistic: 50.13 on 1 and 30 DF,  p-value: 7.143e-08

The important output here are the coefficients and Adjusted R-squared (variance explained). There are two coefficients, one for the Intercept and one for the Y variable (disp). The Estimates are the estimated values of the coefficients. As such the model ($Y = a*X + b$) here is estimated as: $hp = 0.4375 * disp + 45.7345$.

This result seems to fit with the plot above (see the Correlation section).

The p-values (Pr(>|t|)) tests the hypothesis that the Estimate is zero. A low p-value therefore suggests that the Estimate is unlikely to be zero, given the assumptions of the linear regression. We can therefore see that the engine displacement is significantly associated with the horsepower. Also notice that the p-value is similar to the one from Pearson correlation. The adjusted R-squared tells us that our model explains 61.3% of the variance in the horsepower variable.

Check assumptions in R:

The easiest way to check the assumptions of a linear regression is to plot the fitted model. This produces four plots, which should be used to ensure that the assumptions hold.

In [20]:
par(mfrow=c(2,2)) # Plot in a 2x2 grid
plot(fit)

Residuals vs Fitted

This plot should be used to check for non-linearities in the residuals. The red line should be approximately linear. This is to ensure that there is not some hidden non-linearity that our model does not capture.

Normal Q-Q

This plot should be used to check for normality of residuals. The points should approximately follow the line. If you are unsure how it should look, you can simulate random normally distributed data to get a feel of it. Try to run the below several times with different samples sizes and check the output:
n_samples <- 20 residuals_simulated <- rnorm(n_samples) qqnorm(residuals_simulated) qqline(residuals_simulated)

Scale-Location

This plot should be used to check that residuals are homoscedastistic. The points should have an equal spread across the range of fitted values. If, for example, points were further from each other at high fitted values than lower fitted values, it could help re-running the model with a log-transformed response variable.

Residuals vs Leverage

This plot should be used to identify outliers. The Cook's distance (dotted lines) measures the impact of a data point on a model, specifically how much the predicted estiamte of y would change if the data point was omitted from the model. An outlier is a data point which has a substantially higher Cook's distance than most of the other observations. We can see that the data point with the Maserati Bora is maybe an outlier, but Cook's distances below 1 are usually not a problem.

If the above is unclear, details can be found in this article

ANOVA

An ANOVA can be seen as an extension of a t-test. In the t-test you are only comparing two groups of samples. ANOVA is for variables with more than two groups.

Assumptions

Assumptions are as with the linear regression, and should be checked in a similar way.

In-depth paper on the ANOVA

In R:

Let's check if cars with different number of gears have different horsepower:

In [21]:
mtcars$gear <- as.factor(mtcars$gear)

# Define model
fit <- aov(hp ~ gear, data = mtcars)

# Get a summary of the model
summary(fit)
            Df Sum Sq Mean Sq F value  Pr(>F)    
gear         2  64213   32106   11.42 0.00022 ***
Residuals   29  81514    2811                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The number of gears seems to have an effect on the horsepower.

Note on as.factor()

We want to treat the 'gear' variable as categorical, which we tell R by converting it into a factor. This is crucial since this variable just consists of numbers (3, 4, or 5) and would by default therefore be treated as a continuous variable.

Multiple predictors in ANOVA

It is possible to add multiple predictors to ANOVA's (they can be added with +'s on the right-hand side of the formula). It important to note that the default ANOVA in R is using Type I Sums of Squares (SS), hence the order of the predictors will produce different results (unless the design is fully factorial). See this excellent blog for details on differences between Type I/II/III SS in ANOVA.

TukeyHSD post-hoc

We can run a TukeyHSD to get all pairwise comparisons

In [22]:
fitTukey <- TukeyHSD(fit)
fitTukey
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = hp ~ gear, data = mtcars)

$gear
         diff        lwr       upr     p adj
4-3 -86.63333 -137.34379 -35.92288 0.0006267
5-3  19.46667  -48.14727  87.08061 0.7589758
5-4 106.10000   36.40514 175.79486 0.0021447

Linear model

A linear model is a general term which includes both linear regression and ANOVA. A linear regression is for continuous indepedent variables and ANOVA is for categorical independent variables. A linear model can be used for both, and a combination of both.

Assumptions

Assumptions are as with the linear regression, and should be checked in a similar way.

Furthermore, if there are more than one independent variable, they should not correlate strongly. Correlated predictors is called collinearity and can highly skew the results (See this paper for details)

In R:

Let's try to run the same model, as for the ANOVA above.

In [23]:
# Define model
fit <- lm(hp ~ gear, data = mtcars)

# Get a summary of the model
summary(fit)
Call:
lm(formula = hp ~ gear, data = mtcars)

Residuals:
     Min       1Q   Median       3Q      Max 
-104.600  -26.133    3.683   30.025  139.400 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   176.13      13.69  12.867 1.63e-13 ***
gear4         -86.63      20.53  -4.219  0.00022 ***
gear5          19.47      27.38   0.711  0.48274    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53.02 on 29 degrees of freedom
Multiple R-squared:  0.4406,	Adjusted R-squared:  0.4021 
F-statistic: 11.42 on 2 and 29 DF,  p-value: 0.0002196

The "gear" variable is split into so-called dummy variables. The first category (3 gears) is set as the Intercept, and we get estimates for the other categories (4 gears and 5 gears). The values next to "gear4" is the difference between the Intercept (3 gears) and the 4 gear category. The values next to "gear5" is the difference between the Intercept (3 gears) and the 5 gear category.

Pairwise comparisons

Similar to the ANOVA where we ran TukeyHSD, we can get comparisons between all pairs of categories for a linear model. For this we use the multcomp package:

In [24]:
library(multcomp)
summary(glht(fit, linfct=mcp(gear="Tukey")))
Indlæser krævet pakke: mvtnorm

Indlæser krævet pakke: survival

Indlæser krævet pakke: TH.data

Indlæser krævet pakke: MASS


Vedhæfter pakke: 'TH.data'


Det følgende objekt er maskeret fra 'package:MASS':

    geyser


	 Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = hp ~ gear, data = mtcars)

Linear Hypotheses:
           Estimate Std. Error t value Pr(>|t|)    
4 - 3 == 0   -86.63      20.53  -4.219  < 0.001 ***
5 - 3 == 0    19.47      27.38   0.711  0.75635    
5 - 4 == 0   106.10      28.22   3.760  0.00207 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)

The results is similar to the TukeyHSD results from the ANOVA section

Multiple predictors

In the linear model, we can include multiple predictors (as long as they do not correlate too much with each other). It is important to stress that the estimate and p-value of a predictor can change depending on if/which other predictors are included in the model. A model should therefore be interepreted as a whole.

Let's include "disp" and "drat" in the above model

In [25]:
# Define model
fit <- lm(hp ~ gear + disp + drat, data = mtcars)

# Get a summary of the model
summary(fit)
Call:
lm(formula = hp ~ gear + disp + drat, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-64.755 -18.886  -1.149  13.994  90.693 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.29258   77.22139  -0.030  0.97653    
gear4       13.33969   25.69638   0.519  0.60790    
gear5       79.46678   24.59952   3.230  0.00324 ** 
disp         0.50930    0.07915   6.435 6.78e-07 ***
drat         3.90785   21.20956   0.184  0.85519    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 34.06 on 27 degrees of freedom
Multiple R-squared:  0.785,	Adjusted R-squared:  0.7532 
F-statistic: 24.65 on 4 and 27 DF,  p-value: 1.125e-08

We can check if we have problems with collinearity with the vif function from the car package. A rule of thump is that no vif should be above 5. The first column in the output "GVIF" should be used.

In [26]:
car::vif(fit)
A matrix: 3 × 3 of type dbl
GVIFDfGVIF^(1/(2*Df))
gear4.13827621.426280
disp2.57115611.603482
drat3.43631311.853729

We can run an ANOVA on this linear model, which will combine categorical predictors into one p-value:

In [27]:
anova(fit)
A anova: 4 × 5
DfSum SqMean SqF valuePr(>F)
<int><dbl><dbl><dbl><dbl>
gear 264212.9416732106.4708327.674145992.897836e-07
disp 150150.1954550150.1954543.226919504.723861e-07
drat 1 39.38508 39.38508 0.033947938.551945e-01
Residuals2731324.35281 1160.16122 NA NA

Homogeneity of variance

It is an assumption of all linear models (including linear regression and ANOVA) and for Pearson's correlation that there is homogeneity of variance (also called homoscedasticity, as opposed to heteroscedasticity). This means that the spread (variance) of the data points should not have any association with the mean.

Let's try to simulate some heteroscedastic data to visualize the problem

In [28]:
# Simulate heteroscedastic data
set.seed(1)
df <- data.frame(Group = c(rep("A", 20), rep("B", 20), rep("C", 20)),
                Value = c(rpois(20, 20), rpois(20, 100), rpois(20, 200)))

p <- ggplot(df, aes(Group, Value)) +
    theme_bw() +
    geom_point()
p

Let's try to run an ANOVA on the above data, and plot the diagnostics

In [29]:
fit <- aov(Value ~ Group, data = df)
plot(fit, which=1)

We can see that the spread is increasing as the fitted values increase. The p-values of this model will be highly biased. When variance increases with the mean, it often helps to log-transform the response variable. Let's try that:

In [30]:
fit_log <- aov(log10(Value) ~ Group, data = df)
plot(fit_log, which=1)

It helps somewhat, but partially creates the reverse problem; variance is higher with lower fitted values, but much better than the raw data.

Alternatively, one could use permutation, robust methods, or you could use another distibution than the normal ditribution. The details of these alternatives are beyond the scope of this notebook.