Supervised learning - a short primer¶

This notebook covers some basic supervised learning techniques, which can be used in microbiome science.

The following methods can be used to find features (for example ASVs) that can predict some outcome of interest, for example whether a sample comes from a control or treatment group, or some other metadata associated with the samples. In some sense, the methods have the same outcome as in Differential abundance analysis, but with supervised machine learning the purpose is not of inference (based on p-values), but it is prediction.

Getting in-depth: If you want to learn more about machine learning I can highly recommend The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman.

Short primers:

Dataset ¶

Let's load our example dataset

library(phyloseq)
load("../data/physeq.RData")

Train, validate, and test ¶

To ensure our model is not overfitting (having a good fit in our dataset, but cannot be generalized to other similar datasets), we need to split our dataset into 3 parts to ensure proper fitting. The train set of the dataset is used to train a specific model, the validation set is used compare models to choose hyperparameters of the model, and the test set is used only to check how good our final model works.

Cross-validation¶

A widely used method for choosing hyperparameters is to use cross-validation. With cross-validation the train and validation datasets are combined and are split into k parts. Then the model is fit k times using a different part of the dataset each time as validation set, and the remainder as train set.

To use a test set or not¶

If you want to know how good your model is, you should use a test set which has not been used for training at all. It's actually uncommon to see a test set in microbiome science, but because it is best practice we will use it in this notebook

In-depth paper on overfitting

# Split in test and train/validate (30 random samples are used a test set)
set.seed(42)
test_set <- sample(sample_names(phy), 30)

phy_train <- subset_samples(phy, !sample_names(phy) %in% test_set)
phy_test <- subset_samples(phy, sample_names(phy) %in% test_set)

phy_train

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 1310 taxa and 120 samples ]
sample_data() Sample Data:       [ 120 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 1310 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 1310 tips and 1309 internal nodes ]
refseq()      DNAStringSet:      [ 1310 reference sequences ]

phy_test

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 1310 taxa and 30 samples ]
sample_data() Sample Data:       [ 30 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 1310 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 1310 tips and 1309 internal nodes ]
refseq()      DNAStringSet:      [ 1310 reference sequences ]

Sparse regularized linear models ¶

A simple supervised learning method would be to use multiple linear regression, and simply add all features as independent variables. However, the problem with microbiome datasets is that we usually have many more features than samples (p > n problem), which means we cannot fit these models. A way to fix this problem is to use sparse regularization; the idea is that we penalize the model when it adds features, meaning that we try to force the model to only use features that are important enough for the prediction.

In-depth paper on regularization

L1 - LASSO¶

L1 penalty or LASSO, is a penalty which sets the estimates of "non-important" features to zero, that is, it selectes which features are most important for predicting the outcome. If there are highly correlating features, it will choose randomly among these.

L2 - Ridge¶

L2 penalty or Ridge, is a penalty which reduces the estimates of all features as more features are added. It will therefore not select features (most estimates will be non-zero), but it will regularize the model. It is better fitted than LASSO if highly correlated features are included in the model

Elastic net¶

Elastic net is a generalized penalty which introduces an alpha parameter. When alpha=1 it is a LASSO penalty, when alpha=0 it is Ridge, and with alpha between 0 and 1 it is a mix of the two.

Hyperparameters¶

All these sparse regularization methods needs a lamda hyperparameter, which controls how strong the penalty is. Elastic net additionally needs the alpha hyperparameter.

In R:¶

Let's fit a LASSO model. We use logistic regression as we want to predict whether our sample comes from a child which has been born by C-section (1) or vaginal birth (0).

In-depth paper on logistic regression

library(glmnet)

Indlæser krævet pakke: Matrix

Loaded glmnet 4.1-2

# Extract outcome and make it binary
y <- ifelse(unlist(sample_data(phy_train)[,"Delivery"]) == "Sectio", 1, 0)

# Extract features and normalize and transform them
X <- otu_table(phy_train)
X <- apply(X, 2, function(x) x+1/sum(x+1))
X <- t(log10(X))

Note on transformation: As the model assume linearity we log-transform the relative abundances to make them more normal. Alternatively, one could do a CLR transformation of the abundances.

Fit the model (with 5 cross-validation folds, 5-10 are usually recommended)

cvfit <- cv.glmnet(X, y, family = "binomial", alpha = 1, nfolds = 5)

We can plot the lambda parameter against the deviance. Low deviance means it's a good fit.

plot(cvfit)

If we start reading the plot from the left, we have many features in the model (84), as lambda increases (moving to the right), we get fewer features and the deviance is getting smaller. At some point the deviance starts rising again as we get even fewer features. The first vertical line denotes the best model (lowest deviance), the second vertical line denotes the simplest model of which the deviance is within 1 standard error of the mean of the best model.

Why is the deviance curve U-shaped?¶

With few features in the model (high lambda, right in the plot) we simply don't have enough information to make good predictions. With many features (low lambda, left in the plot) we start overfitting; features are added that only contributes variance to the model, so they might correlate with the outcome in the training set, but not in the validation set, and are therefore probably noise. So the sweet spot is somewhere inbetween - the lowest point in the U. If the curve is not U-shaped you might have specified the model incorretly or there is simply no signal in the data.

Let's see the coefficients of the simplest (but still good) model:

all_1se <- as.matrix(coef(cvfit, s = "lambda.1se"))
chosen_1se <- all_1se[all_1se > 0, ]
chosen_1se

Get taxonomy of the chosen ones (the -1 removes the intercept):

tax_table(phy)[names(chosen_1se)[-1]]

Above we have our chosen features and associated estimates for the model. As glmnet by default is standardizing the features, the estimates can be compared directly, and the highest estimate (in absolute terms) can be said to be most important for the prediction. Positive estimates would mean that higher abundance results in increase odds of being in the 1 group (C-section) compared to the 0 group (vaginal birth), and vice versa for negative estimates. This a strength of linear models compared to for example decision trees (e.g. random forest) where the associations can be non-linear and therefore not necesarrily easily interpretable.

Let's check how good the model is on the test set

# Extract outcome and make it binary
y_test <- ifelse(unlist(sample_data(phy_test)[,"Delivery"]) == "Sectio", 1, 0)

# Extract features and normalize and transform them
X_test <- otu_table(phy_test)
X_test <- apply(X_test, 2, function(x) x+1/sum(x+1))
X_test <- t(log10(X_test))

table(y_test, predict(object = cvfit, s = "lambda.1se", newx = X_test, type = "class"))

      
y_test  0  1
     0 11  3
     1  4 12

The rows are the truths (test set) and the columns are the predicted. So of the 14 samples that were 0 (vaginal birth), 11 were correctly predicted as such and 3 were false predicted as 1 (C-section). Of the 16 samples that were 1 (C-section), 12 were correctly predicted as such and 4 were false predicted as 0 (vaginal birth). The accuracy is 77% ((11+12)/30).

Let's test our model on our train/validation set:

table(y, predict(object = cvfit, s = "lambda.1se", newx = X, type = "class"))

   
y    0  1
  0 57  4
  1  3 56

Now the accuracy is 94%. So we can see that the accuracy becomes falsely inflated if we were to test it on the same dataset as we used to training.

For more details on sparse regularized linear models, see here

Random forests ¶

Random forests is a model based on an ensemble of decision trees based on bagging (bootstrap aggregating) and random subsets of the features.

Decision tree¶

A decision tree is a type of model that splits the samples into groups (leaves on the tree) based on the features. For example, a simple tree could contain a single split such that samples in which an ASV is less abundant than 1% are in branch A and samples in which this ASV is more abundant than 1% are in branch B. The decision trees can have multiple splits, such that they are based on multiple features. In-depth paper on decision trees

Bootstrapping¶

With bootstrapping one does a random sampling with replacement of the samples. That is, instead of using the raw data as input to the model, the model randomly chooses the same number of samples, but some samples can be included more than one time (and some might be excluded). Bootstrapping is crucial, as without bootstrapping the trees would be highly similar (correlated), which would bias the final model. In-depth paper in the bootstrap

Random feature subset¶

To create more variation among the decision trees, only a random subset of features are used at each split.

In essence¶

Random forest is fitting multiple decision trees, each tree is trained on bootstrapped samples, and each split in the trees are using random subsets of the features. Each tree is actually a bad predictor, but aggregating a lot of trees will produce good predictions. For classification problems they are aggregated by majority vote, and for regression problems the aggregation is simply the mean predictions across all trees. In-depth paper on bagging and random forest

Hyperparameters¶

There are 2 main hyperparameters for random forests, the number of trees, and the number of randomly chosen features at each split (AKA mtry). The last one has some widely used recommendations; sqrt(n_features) for classification problems and n_features/3 for regression problems. As for the number of trees, a few hundred usually works well, but this number could also be tuned through cross-validation.

In R:¶

Below we simply use the default parameters for random forest, and we therefore only have a training and test set.

library(randomForest)

randomForest 4.6-14

Type rfNews() to see new features/changes/bug fixes.

fit <- randomForest(y = factor(y), x = X, 
                    ytest = factor(y_test), xtest = X_test, 
                    importance = TRUE)

fit

Call:
 randomForest(x = X, y = factor(y), xtest = X_test, ytest = factor(y_test),      importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 36

        OOB estimate of  error rate: 33.33%
Confusion matrix:
   0  1 class.error
0 47 14   0.2295082
1 26 33   0.4406780
                Test set error rate: 26.67%
Confusion matrix:
   0  1 class.error
0 10  4   0.2857143
1  4 12   0.2500000

So the model has a test accuracy of 73%, a little worse than the LASSO linear model used above. The OOB estimate of error rate means Out-Of-Bag, and is using the samples not included by the bootstrap procedure as a validation set.

We can see which features are important for the model. MeanDecreaseAccuracy is the decrease in accuracy if this features was removed from the model, thus a high value means the feature is important:

fit$importance[rev(order(fit$importance[, "MeanDecreaseAccuracy"])), ]

We can merge this with the taxonomy:

tax <- data.frame(tax_table(phy))
imp <- fit$importance
imp_tax <- merge(imp, tax, by = "row.names")
imp_tax[rev(order(imp_tax$MeanDecreaseAccuracy)), ]

	Kingdom	Phylum	Class	Order	Family	Genus	Species
2424186d5cc0b7c749f3645004df6a17	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Enterobacteriaceae	Escherichia	Genus_Escherichia
e19d099ae7e7587eaa48af0f58917a0f	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Enterobacteriaceae	Escherichia	Genus_Escherichia
1f685887ffe789e817d581f37aa3cfab	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Enterobacteriaceae	Escherichia	Genus_Escherichia
43f3f2e3459e11bed9c3b5c187767fcf	Bacteria	Proteobacteria	Gammaproteobacteria	Betaproteobacteriales	Burkholderiaceae	Sutterella	Sutterella_wadsworthensis(GB_GCA_000980335.1)
d4f4f46ec402280ae29751a0c0029f51	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Enterobacteriaceae	Family_Enterobacteriaceae	Family_Enterobacteriaceae
7bc7407c94d7043a80982998cbac426d	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Enterobacteriaceae	Family_Enterobacteriaceae	Family_Enterobacteriaceae
01afdaa62d29c7baa9d60216b0ea90c1	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Enterobacteriaceae	Family_Enterobacteriaceae	Family_Enterobacteriaceae
aa0d9eb04bfc40cf6079830cf147dea2	Bacteria	Proteobacteria	Gammaproteobacteria	Pseudomonadales	Moraxellaceae	Moraxella	Moraxella_catarrhalis(RS_GCF_000092265.1)
b411bdf72ff49585b1b1d6058244f6ee	Bacteria	Bacteroidota	Bacteroidia	Bacteroidales	Barnesiellaceae	Barnesiella	Barnesiella_intestinihominis(GB_GCA_000980475.1)
eae785b902cef8893ff591183dfa4efd	Bacteria	Bacteroidota	Bacteroidia	Bacteroidales	Bacteroidaceae	Prevotella	Prevotella_buccae(RS_GCF_000162455.1)
9dece060ffcff93473a7023dcee0defd	Bacteria	Firmicutes_A	Clostridia	Clostridiales	Clostridiaceae	Clostridium_P	Clostridium_P_perfringens(RS_GCF_000009685.1)
bb44afa4a4d1bf473bb3fa1c5427e8de	Bacteria	Firmicutes_C	Negativicutes	Veillonellales	Veillonellaceae	Veillonella	Genus_Veillonella
50d6182d5d321721f2d223d9757415e2	Bacteria	Firmicutes_C	Negativicutes	Veillonellales	Veillonellaceae	Veillonella	Veillonella_dispar(RS_GCF_000160015.1)
c1cb63a49b75a357d6fa0766f159baa3	Bacteria	Bacteroidota	Bacteroidia	Bacteroidales	Bacteroidaceae	Bacteroides	Bacteroides_eggerthii(RS_GCF_000273465.1)
097c0e27d72b6830014897b90dd3f899	Bacteria	Firmicutes_A	Clostridia	Lachnospirales	Lachnospiraceae	Ruminococcus_B	(RS_GCF_000509105.1)

	0	1	MeanDecreaseAccuracy	MeanDecreaseGini
50d6182d5d321721f2d223d9757415e2	0.0051121646	0.0090025025	0.006967982	0.9682275
7bc7407c94d7043a80982998cbac426d	0.0044255706	0.0039640536	0.004135349	0.5744608
c1cb63a49b75a357d6fa0766f159baa3	0.0020595914	0.0054573756	0.003698760	0.5236623
45c4f6f269ec95ff0c8dd594104519e9	0.0044879250	0.0023865451	0.003621151	0.6068007
d20d3658de331c939f54d8acaed7c4c1	0.0005296290	0.0057009904	0.003129186	0.5987402
ca5152dd2313e7fe7d25872f10c57f6d	0.0019553035	0.0041834328	0.003089687	0.5076726
eae785b902cef8893ff591183dfa4efd	0.0015453908	0.0039137051	0.002698606	0.3370282
d44ac2ef51574f366f2223bb95cde111	0.0016763331	0.0037300994	0.002644548	0.5763823
ccc3cce7144df93b827dc9a9fd18dfaf	0.0021739613	0.0031685122	0.002562180	0.4822958
aa0d9eb04bfc40cf6079830cf147dea2	0.0019777643	0.0025166883	0.002168577	0.2161084
72bda5561e865a5539c4f19f992ee830	0.0016357119	0.0022870886	0.002121116	0.4508188
8c3ff6c4c4b125d5e72f5276d57be4d0	0.0018217075	0.0025671944	0.002084391	0.4306079
deb0bdec6009f88d3cb68e762aa28ae2	0.0004330010	0.0036057044	0.001993373	0.5082881
b411bdf72ff49585b1b1d6058244f6ee	0.0020739355	0.0015737254	0.001884547	0.3565262
57face48ec571894748b66cab7c52d5e	0.0007047258	0.0032564735	0.001850147	0.3893174
9dece060ffcff93473a7023dcee0defd	0.0014796831	0.0022734709	0.001834456	0.1864582
e89deffbf236ceb5b5715a75ff565362	0.0028756256	0.0010022862	0.001732648	0.3422206
8b6432496d22947e776670c0e6d2cc70	0.0014748862	0.0019072294	0.001639363	0.2684984
1099aa16dc16f656847f5ff1158f67f2	0.0014157043	0.0018129209	0.001572846	0.3879930
2424186d5cc0b7c749f3645004df6a17	0.0018214676	0.0015164960	0.001565761	0.2682592
a81c0b48b67c7b7ffbb424e74a7abc18	0.0007558351	0.0021719934	0.001550218	0.1941805
7078a2866c4a4b71fcc6ed7779369c8d	0.0005892904	0.0025832742	0.001502902	0.3049354
d4f4f46ec402280ae29751a0c0029f51	0.0023454304	0.0007132251	0.001451434	0.3091102
8639874b1f9d7eaf752febcf6f7ca9dd	0.0017188387	0.0007715465	0.001291527	0.2049717
20cfe7f61d18f6525cc71caae0ab28dc	0.0007501407	0.0018185977	0.001280120	0.4244153
5cea774e6a6fc2b459c324708ac5a938	0.0014033623	0.0011466613	0.001209362	0.2079192
11d9c9ed1d2d58a9cb0bbcdc93aa3e55	0.0013041533	0.0011300631	0.001186918	0.2431834
4434d904522c3f016aa715d906835ac9	0.0021603211	0.0001244357	0.001169897	0.2064970
f38996f4570cc50b2c5c017248157b5f	0.0011726084	0.0011365550	0.001148575	0.1746500
32f1a5ebac1fd78f3e3ba08e7e69ae22	0.0015186747	0.0007744767	0.001117803	0.1984658
⋮	⋮	⋮	⋮	⋮
fccf9b38f0596644eab0612a5cd57597	-1.967570e-04	-4.818865e-04	-0.0003290496	0.051999483
800a98e4afebc3671d3ecb61c20a9ebf	-2.381818e-04	-4.225240e-04	-0.0003303130	0.058927076
6ae50439574d86851b2fcc8da95799d7	-1.800000e-04	-4.812834e-04	-0.0003333333	0.039350609
8a708b7d22364ef01b63768cb3d65a47	-6.480214e-04	-6.198758e-05	-0.0003368554	0.057117816
d60f98c9d37d9826084e1cced55a9603	-5.121961e-04	-1.199550e-04	-0.0003396052	0.107884726
d34ae33d6d3b7263fbb06ee56f42cad5	2.885965e-04	-9.209024e-04	-0.0003413579	0.044531950
abe4d8ae6feac48213ae7ad7d91a980d	-7.493886e-05	-5.707904e-04	-0.0003416605	0.125406294
69c79e8f047a62b91945789163af472a	-4.838462e-04	-1.492620e-04	-0.0003493995	0.023154035
5d385e9cae0b53aac88268333085dbad	-1.379310e-04	-5.454545e-04	-0.0003562016	0.017654422
3f6d2cfe63af05924d80cc830508d812	-4.954611e-04	-1.942857e-04	-0.0003652833	0.052178368
b60005ac44a7a8201fd8fb0eb568e7e9	-3.275899e-04	-3.822693e-04	-0.0003734579	0.065446856
69a55d967a3ed1f68b1833eb62d82fbd	4.435868e-05	-8.275826e-04	-0.0003755230	0.137407853
b6b383eaf9e218d178a7ea40745ebabb	-5.544890e-04	-1.805013e-04	-0.0003758770	0.048912200
d0f645df3d7e1695f903478dabd44e28	-1.333333e-04	-5.869565e-04	-0.0003805019	0.053853401
8684679687bdd5cc1e37e1b19d139640	-5.254516e-05	-8.603997e-04	-0.0003834474	0.046909380
eff22d490541422627e546be4f44601f	-7.255686e-04	-3.886024e-05	-0.0003858904	0.197407288
f42bcefbf37a01aef7b9483ca5bf6961	-4.761905e-04	-3.000000e-04	-0.0003902439	0.008533333
848c5f08a0a08680e0f2783a9088fc53	-3.250044e-04	-3.856322e-04	-0.0003997788	0.057749378
4fad93adaab891f153808884f1da1fec	-4.898785e-04	-2.991453e-04	-0.0004074642	0.058284787
a4509b13baa96bceb7547508b1aa4242	-7.482517e-04	-1.435407e-05	-0.0004251371	0.070461768
cceb8bd4c48dbbc2e0a83eb892990a22	-3.758671e-05	-8.586467e-04	-0.0004775779	0.149899674
e8876853853f1863fd77efbd7d874967	-6.835197e-04	-3.198920e-04	-0.0004789320	0.049155868
2ea17744c7eeab459b7f41d4f9e22894	-5.444444e-04	-4.594911e-04	-0.0005072147	0.022866771
44d9fc4de8898b6c82c69654435a9f0b	-9.093117e-04	2.803194e-04	-0.0005347025	0.179871621
4e8b51098d3b598eadf069dcc96e81aa	-6.222727e-04	-3.953461e-04	-0.0005401932	0.345341356
625ebcdc01a6ae464668e8c2b9817a12	-2.653941e-04	-7.847601e-04	-0.0006588546	0.050838016
a600fd96dfa563a51e8e70a101cb6fb2	-6.511640e-04	-6.893139e-04	-0.0006950508	0.089450244
4d5512974bb5e74d555bdc9cf33c8d63	-8.348895e-04	-9.004046e-04	-0.0008705296	0.058151245
107063e8453bf94891c4d7a17b00af07	-7.478750e-04	-8.853785e-04	-0.0008801204	0.131385216
08b21f8171b35ef001832ba1df9b8fbc	-1.779978e-03	-7.059238e-04	-0.0012150967	0.188807717

	Row.names	0	1	MeanDecreaseAccuracy	MeanDecreaseGini	Kingdom	Phylum	Class	Order	Family	Genus	Species
	<I<chr>>	<dbl>	<dbl>	<dbl>	<dbl>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>
415	50d6182d5d321721f2d223d9757415e2	0.0051121646	0.0090025025	0.006967982	0.9682275	Bacteria	Firmicutes_C	Negativicutes	Veillonellales	Veillonellaceae	Veillonella	Veillonella_dispar(RS_GCF_000160015.1)
628	7bc7407c94d7043a80982998cbac426d	0.0044255706	0.0039640536	0.004135349	0.5744608	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Enterobacteriaceae	Family_Enterobacteriaceae	Family_Enterobacteriaceae
973	c1cb63a49b75a357d6fa0766f159baa3	0.0020595914	0.0054573756	0.003698760	0.5236623	Bacteria	Bacteroidota	Bacteroidia	Bacteroidales	Bacteroidaceae	Bacteroides	Bacteroides_eggerthii(RS_GCF_000273465.1)
362	45c4f6f269ec95ff0c8dd594104519e9	0.0044879250	0.0023865451	0.003621151	0.6068007	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Enterobacteriaceae	Citrobacter	Genus_Citrobacter
1055	d20d3658de331c939f54d8acaed7c4c1	0.0005296290	0.0057009904	0.003129186	0.5987402	Bacteria	Bacteroidota	Bacteroidia	Bacteroidales	Bacteroidaceae	Bacteroides	Genus_Bacteroides
1010	ca5152dd2313e7fe7d25872f10c57f6d	0.0019553035	0.0041834328	0.003089687	0.5076726	Bacteria	Firmicutes	Bacilli	Lactobacillales	Lactobacillaceae	Lactobacillus_B	Lactobacillus_B_ruminis(RS_GCF_001436475.1)
1182	eae785b902cef8893ff591183dfa4efd	0.0015453908	0.0039137051	0.002698606	0.3370282	Bacteria	Bacteroidota	Bacteroidia	Bacteroidales	Bacteroidaceae	Prevotella	Prevotella_buccae(RS_GCF_000162455.1)
1068	d44ac2ef51574f366f2223bb95cde111	0.0016763331	0.0037300994	0.002644548	0.5763823	Bacteria	Bacteroidota	Bacteroidia	Bacteroidales	Bacteroidaceae	Bacteroides	Bacteroides_rodentium(GB_GCA_000614125.1)
1018	ccc3cce7144df93b827dc9a9fd18dfaf	0.0021739613	0.0031685122	0.002562180	0.4822958	Bacteria	Firmicutes	Bacilli	Lactobacillales	Lactobacillaceae	Lactobacillus_C	Lactobacillus_C_casei(RS_GCF_000829055.1)
1301	aa0d9eb04bfc40cf6079830cf147dea2	0.0019777643	0.0025166883	0.002168577	0.2161084	Bacteria	Proteobacteria	Gammaproteobacteria	Pseudomonadales	Moraxellaceae	Moraxella	Moraxella_catarrhalis(RS_GCF_000092265.1)
579	72bda5561e865a5539c4f19f992ee830	0.0016357119	0.0022870886	0.002121116	0.4508188	Bacteria	Firmicutes_A	Clostridia	Lachnospirales	Lachnospiraceae	Ruminococcus_B	Ruminococcus_B_gnavus(RS_GCF_000526735.1)
720	8c3ff6c4c4b125d5e72f5276d57be4d0	0.0018217075	0.0025671944	0.002084391	0.4306079	Bacteria	Firmicutes_A	Clostridia	Oscillospirales	Ruminococcaceae	Bittarella	(GB_GCA_900066655.1)
1114	deb0bdec6009f88d3cb68e762aa28ae2	0.0004330010	0.0036057044	0.001993373	0.5082881	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Enterobacteriaceae	Escherichia	Escherichia_sp2(RS_GCF_000407765.1)
904	b411bdf72ff49585b1b1d6058244f6ee	0.0020739355	0.0015737254	0.001884547	0.3565262	Bacteria	Bacteroidota	Bacteroidia	Bacteroidales	Barnesiellaceae	Barnesiella	Barnesiella_intestinihominis(GB_GCA_000980475.1)
450	57face48ec571894748b66cab7c52d5e	0.0007047258	0.0032564735	0.001850147	0.3893174	Bacteria	Bacteroidota	Bacteroidia	Bacteroidales	Muribaculaceae	Family_Muribaculaceae	Family_Muribaculaceae
810	9dece060ffcff93473a7023dcee0defd	0.0014796831	0.0022734709	0.001834456	0.1864582	Bacteria	Firmicutes_A	Clostridia	Clostridiales	Clostridiaceae	Clostridium_P	Clostridium_P_perfringens(RS_GCF_000009685.1)
1170	e89deffbf236ceb5b5715a75ff565362	0.0028756256	0.0010022862	0.001732648	0.3422206	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Enterobacteriaceae	Escherichia	Genus_Escherichia
713	8b6432496d22947e776670c0e6d2cc70	0.0014748862	0.0019072294	0.001639363	0.2684984	Bacteria	Bacteroidota	Bacteroidia	Bacteroidales	Barnesiellaceae	Barnesiella	Barnesiella_intestinihominis(GB_GCA_000980475.1)
89	1099aa16dc16f656847f5ff1158f67f2	0.0014157043	0.0018129209	0.001572846	0.3879930	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Enterobacteriaceae	Family_Enterobacteriaceae	Family_Enterobacteriaceae
178	2424186d5cc0b7c749f3645004df6a17	0.0018214676	0.0015164960	0.001565761	0.2682592	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Enterobacteriaceae	Escherichia	Genus_Escherichia
863	a81c0b48b67c7b7ffbb424e74a7abc18	0.0007558351	0.0021719934	0.001550218	0.1941805	Bacteria	Bacteroidota	Bacteroidia	Bacteroidales	Bacteroidaceae	Prevotella	Prevotella_copri_A(RS_GCF_002224675.1)
568	7078a2866c4a4b71fcc6ed7779369c8d	0.0005892904	0.0025832742	0.001502902	0.3049354	Bacteria	Bacteroidota	Bacteroidia	Bacteroidales	Bacteroidaceae	Bacteroides_B	Bacteroides_B_dorei(RS_GCF_001640865.1)
1070	d4f4f46ec402280ae29751a0c0029f51	0.0023454304	0.0007132251	0.001451434	0.3091102	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Enterobacteriaceae	Family_Enterobacteriaceae	Family_Enterobacteriaceae
684	8639874b1f9d7eaf752febcf6f7ca9dd	0.0017188387	0.0007715465	0.001291527	0.2049717	Bacteria	Firmicutes_A	Clostridia	Lachnospirales	Lachnospiraceae	Fusicatenibacter	Fusicatenibacter_saccharivorans(RS_GCF_001405555.1)
166	20cfe7f61d18f6525cc71caae0ab28dc	0.0007501407	0.0018185977	0.001280120	0.4244153	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Enterobacteriaceae	Escherichia	Genus_Escherichia
478	5cea774e6a6fc2b459c324708ac5a938	0.0014033623	0.0011466613	0.001209362	0.2079192	Bacteria	Firmicutes_A	Clostridia	Oscillospirales	Ruminococcaceae	Bittarella	(GB_GCA_900066655.1)
99	11d9c9ed1d2d58a9cb0bbcdc93aa3e55	0.0013041533	0.0011300631	0.001186918	0.2431834	Bacteria	Firmicutes	Bacilli	Lactobacillales	Lactobacillaceae	Lactobacillus_H	Lactobacillus_H_fermentum(RS_GCF_001742205.1)
357	4434d904522c3f016aa715d906835ac9	0.0021603211	0.0001244357	0.001169897	0.2064970	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Enterobacteriaceae	Pantoea	Genus_Pantoea
1235	f38996f4570cc50b2c5c017248157b5f	0.0011726084	0.0011365550	0.001148575	0.1746500	Bacteria	Firmicutes_A	Clostridia	Clostridiales	Clostridiaceae	Clostridium_P	Clostridium_P_perfringens(RS_GCF_000009685.1)
248	32f1a5ebac1fd78f3e3ba08e7e69ae22	0.0015186747	0.0007744767	0.001117803	0.1984658	Bacteria	Firmicutes_C	Negativicutes	Acidaminococcales	Acidaminococcaceae	Acidaminococcus	Genus_Acidaminococcus
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
1282	fccf9b38f0596644eab0612a5cd57597	-1.967570e-04	-4.818865e-04	-0.0003290496	0.051999483	Bacteria	Firmicutes_A	Clostridia	Lachnospirales	Lachnospiraceae	CAG-65	CAG-65_sp3(GB_GCA_900066565.1)
648	800a98e4afebc3671d3ecb61c20a9ebf	-2.381818e-04	-4.225240e-04	-0.0003303130	0.058927076	Bacteria	Firmicutes_C	Negativicutes	Veillonellales	Veillonellaceae	Veillonella_A	Veillonella_A_seminalis(RS_GCF_000315505.1)
541	6ae50439574d86851b2fcc8da95799d7	-1.800000e-04	-4.812834e-04	-0.0003333333	0.039350609	Bacteria	Firmicutes_A	Clostridia	Lachnospirales	Lachnospiraceae	Family_Lachnospiraceae	Family_Lachnospiraceae
708	8a708b7d22364ef01b63768cb3d65a47	-6.480214e-04	-6.198758e-05	-0.0003368554	0.057117816	Bacteria	Firmicutes	Bacilli	Staphylococcales	Staphylococcaceae	Staphylococcus	Genus_Staphylococcus
1075	d60f98c9d37d9826084e1cced55a9603	-5.121961e-04	-1.199550e-04	-0.0003396052	0.107884726	Bacteria	Actinobacteriota	Actinobacteria	Actinomycetales	Micrococcaceae	Rothia	Rothia_mucilaginosa_A(RS_GCF_001809565.1)
1060	d34ae33d6d3b7263fbb06ee56f42cad5	2.885965e-04	-9.209024e-04	-0.0003413579	0.044531950	Bacteria	Bacteroidota	Bacteroidia	Bacteroidales	Dysgonomonadaceae	Dysgonomonas	Dysgonomonas_macrotermitis(RS_GCF_001047035.1)
870	abe4d8ae6feac48213ae7ad7d91a980d	-7.493886e-05	-5.707904e-04	-0.0003416605	0.125406294	Bacteria	Proteobacteria	Gammaproteobacteria	Pseudomonadales	Moraxellaceae	Acinetobacter	(RS_GCF_001647675.1)
536	69c79e8f047a62b91945789163af472a	-4.838462e-04	-1.492620e-04	-0.0003493995	0.023154035	Bacteria	Proteobacteria	Alphaproteobacteria	RF32	CAG-239	CAG-495	(GB_GCA_001917125.1)
480	5d385e9cae0b53aac88268333085dbad	-1.379310e-04	-5.454545e-04	-0.0003562016	0.017654422	Bacteria	Firmicutes	Bacilli	Bacillales	Bacillaceae_G	Bacillus_A	Genus_Bacillus_A
324	3f6d2cfe63af05924d80cc830508d812	-4.954611e-04	-1.942857e-04	-0.0003652833	0.052178368	Bacteria	Bacteroidota	Bacteroidia	Bacteroidales	Marinifilaceae	Butyricimonas	(RS_GCF_002161485.1)
910	b60005ac44a7a8201fd8fb0eb568e7e9	-3.275899e-04	-3.822693e-04	-0.0003734579	0.065446856	Bacteria	Firmicutes_A	Clostridia	Lachnospirales	Lachnospiraceae	Ruminococcus_B	Ruminococcus_B_fissicatena(RS_GCF_000190355.1)
534	69a55d967a3ed1f68b1833eb62d82fbd	4.435868e-05	-8.275826e-04	-0.0003755230	0.137407853	Bacteria	Actinobacteriota	Actinobacteria	Actinomycetales	Bifidobacteriaceae	Bifidobacterium	Genus_Bifidobacterium
917	b6b383eaf9e218d178a7ea40745ebabb	-5.544890e-04	-1.805013e-04	-0.0003758770	0.048912200	Bacteria	Firmicutes_A	Clostridia	Peptostreptococcales	Peptostreptococcaceae	Intestinibacter	Intestinibacter_bartlettii(RS_GCF_000154445.1)
1049	d0f645df3d7e1695f903478dabd44e28	-1.333333e-04	-5.869565e-04	-0.0003805019	0.053853401	Bacteria	Bacteroidota	Bacteroidia	Bacteroidales	Rikenellaceae	Alistipes	Genus_Alistipes
689	8684679687bdd5cc1e37e1b19d139640	-5.254516e-05	-8.603997e-04	-0.0003834474	0.046909380	Bacteria	Firmicutes_A	Clostridia	Peptostreptococcales	Peptostreptococcaceae	Clostridioides	Clostridioides_difficile_A(RS_GCF_001299635.1)
1214	eff22d490541422627e546be4f44601f	-7.255686e-04	-3.886024e-05	-0.0003858904	0.197407288	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Enterobacteriaceae	Escherichia	Genus_Escherichia
1239	f42bcefbf37a01aef7b9483ca5bf6961	-4.761905e-04	-3.000000e-04	-0.0003902439	0.008533333	Bacteria	Firmicutes_A	Clostridia	Lachnospirales	Lachnospiraceae	Coprococcus	Coprococcus_eutactus(RS_GCF_000154425.1)
672	848c5f08a0a08680e0f2783a9088fc53	-3.250044e-04	-3.856322e-04	-0.0003997788	0.057749378	Bacteria	Firmicutes_A	Clostridia	Oscillospirales	DTU089	Ruminococcus_E	Ruminococcus_E_bromii(GB_GCA_900067015.1)
409	4fad93adaab891f153808884f1da1fec	-4.898785e-04	-2.991453e-04	-0.0004074642	0.058284787	Bacteria	Firmicutes_A	Clostridia	CAG-41	UBA1381	CAG-41	CAG-41_sp1(GB_GCA_900066215.1)
832	a4509b13baa96bceb7547508b1aa4242	-7.482517e-04	-1.435407e-05	-0.0004251371	0.070461768	Bacteria	Actinobacteriota	Coriobacteriia	Coriobacteriales	Eggerthellaceae	Senegalimassilia	Senegalimassilia_anaerobia(RS_GCF_000236865.1)
1020	cceb8bd4c48dbbc2e0a83eb892990a22	-3.758671e-05	-8.586467e-04	-0.0004775779	0.149899674	Bacteria	Bacteroidota	Bacteroidia	Bacteroidales	Bacteroidaceae	Bacteroides	Bacteroides_clarus(RS_GCF_000195615.1)
1168	e8876853853f1863fd77efbd7d874967	-6.835197e-04	-3.198920e-04	-0.0004789320	0.049155868	Bacteria	Firmicutes_A	Clostridia	Oscillospirales	Ruminococcaceae	Anaerotruncus	Anaerotruncus_colihominis(RS_GCF_001404495.1)
225	2ea17744c7eeab459b7f41d4f9e22894	-5.444444e-04	-4.594911e-04	-0.0005072147	0.022866771	Bacteria	Fusobacteriota	Fusobacteriia	Fusobacteriales	Fusobacteriaceae	Fusobacterium	Fusobacterium_periodonticum_B(RS_GCF_000163935.1)
359	44d9fc4de8898b6c82c69654435a9f0b	-9.093117e-04	2.803194e-04	-0.0005347025	0.179871621	Bacteria	Firmicutes	Bacilli	Lactobacillales	Lactobacillaceae	Lactobacillus	Lactobacillus_johnsonii(RS_GCF_000091405.1)
404	4e8b51098d3b598eadf069dcc96e81aa	-6.222727e-04	-3.953461e-04	-0.0005401932	0.345341356	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Enterobacteriaceae	Family_Enterobacteriaceae	Family_Enterobacteriaceae
504	625ebcdc01a6ae464668e8c2b9817a12	-2.653941e-04	-7.847601e-04	-0.0006588546	0.050838016	Bacteria	Firmicutes_C	Negativicutes	Veillonellales	Dialisteraceae	Dialister	Dialister_micraerophilus(RS_GCF_000183445.1)
845	a600fd96dfa563a51e8e70a101cb6fb2	-6.511640e-04	-6.893139e-04	-0.0006950508	0.089450244	Bacteria	Actinobacteriota	Actinobacteria	Actinomycetales	Dermabacteraceae	Dermabacter	Dermabacter_hominis(RS_GCF_001570785.1)
400	4d5512974bb5e74d555bdc9cf33c8d63	-8.348895e-04	-9.004046e-04	-0.0008705296	0.058151245	Bacteria	Firmicutes_C	Negativicutes	Veillonellales	Dialisteraceae	Dialister	Dialister_micraerophilus(RS_GCF_000183445.1)
86	107063e8453bf94891c4d7a17b00af07	-7.478750e-04	-8.853785e-04	-0.0008801204	0.131385216	Bacteria	Firmicutes_C	Negativicutes	Veillonellales	Veillonellaceae	Veillonella_A	Veillonella_A_seminalis(RS_GCF_000315505.1)
38	08b21f8171b35ef001832ba1df9b8fbc	-1.779978e-03	-7.059238e-04	-0.0012150967	0.188807717	Bacteria	Firmicutes_C	Negativicutes	Veillonellales	Veillonellaceae	Veillonella	Veillonella_dispar(RS_GCF_000160015.1)

Supervised learning - a short primer¶

Dataset¶

Train, validate, and test¶

Cross-validation¶

To use a test set or not¶

Sparse regularized linear models¶

L1 - LASSO¶

L2 - Ridge¶

Elastic net¶

Hyperparameters¶

In R:¶

Why is the deviance curve U-shaped?¶

Random forests¶

Decision tree¶

Bootstrapping¶

Random feature subset¶

In essence¶

Hyperparameters¶

In R:¶

Dataset ¶

Train, validate, and test ¶

Sparse regularized linear models ¶

Random forests ¶