Supervised learning - a short primer

This notebook covers some basic supervised learning techniques, which can be used in microbiome science.

The following methods can be used to find features (for example ASVs) that can predict some outcome of interest, for example whether a sample comes from a control or treatment group, or some other metadata associated with the samples. In some sense, the methods have the same outcome as in Differential abundance analysis, but with supervised machine learning the purpose is not of inference (based on p-values), but it is prediction.

Getting in-depth: If you want to learn more about machine learning I can highly recommend The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman.

Short primers:

Dataset

Let's load our example dataset

In [1]:
library(phyloseq)
load("../data/physeq.RData")

Train, validate, and test

To ensure our model is not overfitting (having a good fit in our dataset, but cannot be generalized to other similar datasets), we need to split our dataset into 3 parts to ensure proper fitting. The train set of the dataset is used to train a specific model, the validation set is used compare models to choose hyperparameters of the model, and the test set is used only to check how good our final model works.

Cross-validation

A widely used method for choosing hyperparameters is to use cross-validation. With cross-validation the train and validation datasets are combined and are split into k parts. Then the model is fit k times using a different part of the dataset each time as validation set, and the remainder as train set.

To use a test set or not

If you want to know how good your model is, you should use a test set which has not been used for training at all. It's actually uncommon to see a test set in microbiome science, but because it is best practice we will use it in this notebook

In-depth paper on overfitting

In [2]:
# Split in test and train/validate (30 random samples are used a test set)
set.seed(42)
test_set <- sample(sample_names(phy), 30)
In [3]:
phy_train <- subset_samples(phy, !sample_names(phy) %in% test_set)
phy_test <- subset_samples(phy, sample_names(phy) %in% test_set)
In [4]:
phy_train
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 1310 taxa and 120 samples ]
sample_data() Sample Data:       [ 120 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 1310 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 1310 tips and 1309 internal nodes ]
refseq()      DNAStringSet:      [ 1310 reference sequences ]
In [5]:
phy_test
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 1310 taxa and 30 samples ]
sample_data() Sample Data:       [ 30 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 1310 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 1310 tips and 1309 internal nodes ]
refseq()      DNAStringSet:      [ 1310 reference sequences ]

Sparse regularized linear models

A simple supervised learning method would be to use multiple linear regression, and simply add all features as independent variables. However, the problem with microbiome datasets is that we usually have many more features than samples (p > n problem), which means we cannot fit these models. A way to fix this problem is to use sparse regularization; the idea is that we penalize the model when it adds features, meaning that we try to force the model to only use features that are important enough for the prediction.

In-depth paper on regularization

L1 - LASSO

L1 penalty or LASSO, is a penalty which sets the estimates of "non-important" features to zero, that is, it selectes which features are most important for predicting the outcome. If there are highly correlating features, it will choose randomly among these.

L2 - Ridge

L2 penalty or Ridge, is a penalty which reduces the estimates of all features as more features are added. It will therefore not select features (most estimates will be non-zero), but it will regularize the model. It is better fitted than LASSO if highly correlated features are included in the model

Elastic net

Elastic net is a generalized penalty which introduces an alpha parameter. When alpha=1 it is a LASSO penalty, when alpha=0 it is Ridge, and with alpha between 0 and 1 it is a mix of the two.

Hyperparameters

All these sparse regularization methods needs a lamda hyperparameter, which controls how strong the penalty is. Elastic net additionally needs the alpha hyperparameter.

In R:

Let's fit a LASSO model. We use logistic regression as we want to predict whether our sample comes from a child which has been born by C-section (1) or vaginal birth (0).

In-depth paper on logistic regression

In [6]:
library(glmnet)
Indlæser krævet pakke: Matrix

Loaded glmnet 4.1-2

In [7]:
# Extract outcome and make it binary
y <- ifelse(unlist(sample_data(phy_train)[,"Delivery"]) == "Sectio", 1, 0)

# Extract features and normalize and transform them
X <- otu_table(phy_train)
X <- apply(X, 2, function(x) x+1/sum(x+1))
X <- t(log10(X))

Note on transformation: As the model assume linearity we log-transform the relative abundances to make them more normal. Alternatively, one could do a CLR transformation of the abundances.

Fit the model (with 5 cross-validation folds, 5-10 are usually recommended)

In [8]:
cvfit <- cv.glmnet(X, y, family = "binomial", alpha = 1, nfolds = 5)

We can plot the lambda parameter against the deviance. Low deviance means it's a good fit.

In [9]:
plot(cvfit)

If we start reading the plot from the left, we have many features in the model (84), as lambda increases (moving to the right), we get fewer features and the deviance is getting smaller. At some point the deviance starts rising again as we get even fewer features. The first vertical line denotes the best model (lowest deviance), the second vertical line denotes the simplest model of which the deviance is within 1 standard error of the mean of the best model.

Why is the deviance curve U-shaped?

With few features in the model (high lambda, right in the plot) we simply don't have enough information to make good predictions. With many features (low lambda, left in the plot) we start overfitting; features are added that only contributes variance to the model, so they might correlate with the outcome in the training set, but not in the validation set, and are therefore probably noise. So the sweet spot is somewhere inbetween - the lowest point in the U. If the curve is not U-shaped you might have specified the model incorretly or there is simply no signal in the data.

Let's see the coefficients of the simplest (but still good) model:

In [10]:
all_1se <- as.matrix(coef(cvfit, s = "lambda.1se"))
chosen_1se <- all_1se[all_1se > 0, ]
chosen_1se
(Intercept)
1.49100719856145
2424186d5cc0b7c749f3645004df6a17
0.0653777787604116
e19d099ae7e7587eaa48af0f58917a0f
0.0585232901384771
1f685887ffe789e817d581f37aa3cfab
0.0465210763409769
43f3f2e3459e11bed9c3b5c187767fcf
0.0179657537606648
d4f4f46ec402280ae29751a0c0029f51
0.0710826617909787
7bc7407c94d7043a80982998cbac426d
0.0269666262793848
01afdaa62d29c7baa9d60216b0ea90c1
0.0249499682384017
aa0d9eb04bfc40cf6079830cf147dea2
0.049157129205169
b411bdf72ff49585b1b1d6058244f6ee
0.108437852270884
eae785b902cef8893ff591183dfa4efd
0.0630101250214731
9dece060ffcff93473a7023dcee0defd
0.0382871318249964
bb44afa4a4d1bf473bb3fa1c5427e8de
0.0185049655630583
50d6182d5d321721f2d223d9757415e2
0.000616705223342962
c1cb63a49b75a357d6fa0766f159baa3
0.0317236590063324
097c0e27d72b6830014897b90dd3f899
0.0151250926216182

Get taxonomy of the chosen ones (the -1 removes the intercept):

In [11]:
tax_table(phy)[names(chosen_1se)[-1]]
A taxonomyTable: 15 × 7 of type chr
KingdomPhylumClassOrderFamilyGenusSpecies
2424186d5cc0b7c749f3645004df6a17BacteriaProteobacteriaGammaproteobacteriaEnterobacterales EnterobacteriaceaeEscherichia Genus_Escherichia
e19d099ae7e7587eaa48af0f58917a0fBacteriaProteobacteriaGammaproteobacteriaEnterobacterales EnterobacteriaceaeEscherichia Genus_Escherichia
1f685887ffe789e817d581f37aa3cfabBacteriaProteobacteriaGammaproteobacteriaEnterobacterales EnterobacteriaceaeEscherichia Genus_Escherichia
43f3f2e3459e11bed9c3b5c187767fcfBacteriaProteobacteriaGammaproteobacteriaBetaproteobacterialesBurkholderiaceae Sutterella Sutterella_wadsworthensis(GB_GCA_000980335.1)
d4f4f46ec402280ae29751a0c0029f51BacteriaProteobacteriaGammaproteobacteriaEnterobacterales EnterobacteriaceaeFamily_EnterobacteriaceaeFamily_Enterobacteriaceae
7bc7407c94d7043a80982998cbac426dBacteriaProteobacteriaGammaproteobacteriaEnterobacterales EnterobacteriaceaeFamily_EnterobacteriaceaeFamily_Enterobacteriaceae
01afdaa62d29c7baa9d60216b0ea90c1BacteriaProteobacteriaGammaproteobacteriaEnterobacterales EnterobacteriaceaeFamily_EnterobacteriaceaeFamily_Enterobacteriaceae
aa0d9eb04bfc40cf6079830cf147dea2BacteriaProteobacteriaGammaproteobacteriaPseudomonadales Moraxellaceae Moraxella Moraxella_catarrhalis(RS_GCF_000092265.1)
b411bdf72ff49585b1b1d6058244f6eeBacteriaBacteroidota Bacteroidia Bacteroidales Barnesiellaceae Barnesiella Barnesiella_intestinihominis(GB_GCA_000980475.1)
eae785b902cef8893ff591183dfa4efdBacteriaBacteroidota Bacteroidia Bacteroidales Bacteroidaceae Prevotella Prevotella_buccae(RS_GCF_000162455.1)
9dece060ffcff93473a7023dcee0defdBacteriaFirmicutes_A Clostridia Clostridiales Clostridiaceae Clostridium_P Clostridium_P_perfringens(RS_GCF_000009685.1)
bb44afa4a4d1bf473bb3fa1c5427e8deBacteriaFirmicutes_C Negativicutes Veillonellales Veillonellaceae Veillonella Genus_Veillonella
50d6182d5d321721f2d223d9757415e2BacteriaFirmicutes_C Negativicutes Veillonellales Veillonellaceae Veillonella Veillonella_dispar(RS_GCF_000160015.1)
c1cb63a49b75a357d6fa0766f159baa3BacteriaBacteroidota Bacteroidia Bacteroidales Bacteroidaceae Bacteroides Bacteroides_eggerthii(RS_GCF_000273465.1)
097c0e27d72b6830014897b90dd3f899BacteriaFirmicutes_A Clostridia Lachnospirales Lachnospiraceae Ruminococcus_B (RS_GCF_000509105.1)

Above we have our chosen features and associated estimates for the model. As glmnet by default is standardizing the features, the estimates can be compared directly, and the highest estimate (in absolute terms) can be said to be most important for the prediction. Positive estimates would mean that higher abundance results in increase odds of being in the 1 group (C-section) compared to the 0 group (vaginal birth), and vice versa for negative estimates. This a strength of linear models compared to for example decision trees (e.g. random forest) where the associations can be non-linear and therefore not necesarrily easily interpretable.

Let's check how good the model is on the test set

In [12]:
# Extract outcome and make it binary
y_test <- ifelse(unlist(sample_data(phy_test)[,"Delivery"]) == "Sectio", 1, 0)

# Extract features and normalize and transform them
X_test <- otu_table(phy_test)
X_test <- apply(X_test, 2, function(x) x+1/sum(x+1))
X_test <- t(log10(X_test))
In [13]:
table(y_test, predict(object = cvfit, s = "lambda.1se", newx = X_test, type = "class"))
      
y_test  0  1
     0 11  3
     1  4 12

The rows are the truths (test set) and the columns are the predicted. So of the 14 samples that were 0 (vaginal birth), 11 were correctly predicted as such and 3 were false predicted as 1 (C-section). Of the 16 samples that were 1 (C-section), 12 were correctly predicted as such and 4 were false predicted as 0 (vaginal birth). The accuracy is 77% ((11+12)/30).

Let's test our model on our train/validation set:

In [14]:
table(y, predict(object = cvfit, s = "lambda.1se", newx = X, type = "class"))
   
y    0  1
  0 57  4
  1  3 56

Now the accuracy is 94%. So we can see that the accuracy becomes falsely inflated if we were to test it on the same dataset as we used to training.

For more details on sparse regularized linear models, see here

Random forests

Random forests is a model based on an ensemble of decision trees based on bagging (bootstrap aggregating) and random subsets of the features.

Decision tree

A decision tree is a type of model that splits the samples into groups (leaves on the tree) based on the features. For example, a simple tree could contain a single split such that samples in which an ASV is less abundant than 1% are in branch A and samples in which this ASV is more abundant than 1% are in branch B. The decision trees can have multiple splits, such that they are based on multiple features. In-depth paper on decision trees

Bootstrapping

With bootstrapping one does a random sampling with replacement of the samples. That is, instead of using the raw data as input to the model, the model randomly chooses the same number of samples, but some samples can be included more than one time (and some might be excluded). Bootstrapping is crucial, as without bootstrapping the trees would be highly similar (correlated), which would bias the final model. In-depth paper in the bootstrap

Random feature subset

To create more variation among the decision trees, only a random subset of features are used at each split.

In essence

Random forest is fitting multiple decision trees, each tree is trained on bootstrapped samples, and each split in the trees are using random subsets of the features. Each tree is actually a bad predictor, but aggregating a lot of trees will produce good predictions. For classification problems they are aggregated by majority vote, and for regression problems the aggregation is simply the mean predictions across all trees. In-depth paper on bagging and random forest

Hyperparameters

There are 2 main hyperparameters for random forests, the number of trees, and the number of randomly chosen features at each split (AKA mtry). The last one has some widely used recommendations; sqrt(n_features) for classification problems and n_features/3 for regression problems. As for the number of trees, a few hundred usually works well, but this number could also be tuned through cross-validation.

In R:

Below we simply use the default parameters for random forest, and we therefore only have a training and test set.

In [15]:
library(randomForest)
randomForest 4.6-14

Type rfNews() to see new features/changes/bug fixes.

In [16]:
fit <- randomForest(y = factor(y), x = X, 
                    ytest = factor(y_test), xtest = X_test, 
                    importance = TRUE)
In [17]:
fit
Call:
 randomForest(x = X, y = factor(y), xtest = X_test, ytest = factor(y_test),      importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 36

        OOB estimate of  error rate: 33.33%
Confusion matrix:
   0  1 class.error
0 47 14   0.2295082
1 26 33   0.4406780
                Test set error rate: 26.67%
Confusion matrix:
   0  1 class.error
0 10  4   0.2857143
1  4 12   0.2500000

So the model has a test accuracy of 73%, a little worse than the LASSO linear model used above. The OOB estimate of error rate means Out-Of-Bag, and is using the samples not included by the bootstrap procedure as a validation set.

We can see which features are important for the model. MeanDecreaseAccuracy is the decrease in accuracy if this features was removed from the model, thus a high value means the feature is important:

In [18]:
fit$importance[rev(order(fit$importance[, "MeanDecreaseAccuracy"])), ]
A matrix: 1310 × 4 of type dbl
01MeanDecreaseAccuracyMeanDecreaseGini
50d6182d5d321721f2d223d9757415e20.00511216460.00900250250.0069679820.9682275
7bc7407c94d7043a80982998cbac426d0.00442557060.00396405360.0041353490.5744608
c1cb63a49b75a357d6fa0766f159baa30.00205959140.00545737560.0036987600.5236623
45c4f6f269ec95ff0c8dd594104519e90.00448792500.00238654510.0036211510.6068007
d20d3658de331c939f54d8acaed7c4c10.00052962900.00570099040.0031291860.5987402
ca5152dd2313e7fe7d25872f10c57f6d0.00195530350.00418343280.0030896870.5076726
eae785b902cef8893ff591183dfa4efd0.00154539080.00391370510.0026986060.3370282
d44ac2ef51574f366f2223bb95cde1110.00167633310.00373009940.0026445480.5763823
ccc3cce7144df93b827dc9a9fd18dfaf0.00217396130.00316851220.0025621800.4822958
aa0d9eb04bfc40cf6079830cf147dea20.00197776430.00251668830.0021685770.2161084
72bda5561e865a5539c4f19f992ee8300.00163571190.00228708860.0021211160.4508188
8c3ff6c4c4b125d5e72f5276d57be4d00.00182170750.00256719440.0020843910.4306079
deb0bdec6009f88d3cb68e762aa28ae20.00043300100.00360570440.0019933730.5082881
b411bdf72ff49585b1b1d6058244f6ee0.00207393550.00157372540.0018845470.3565262
57face48ec571894748b66cab7c52d5e0.00070472580.00325647350.0018501470.3893174
9dece060ffcff93473a7023dcee0defd0.00147968310.00227347090.0018344560.1864582
e89deffbf236ceb5b5715a75ff5653620.00287562560.00100228620.0017326480.3422206
8b6432496d22947e776670c0e6d2cc700.00147488620.00190722940.0016393630.2684984
1099aa16dc16f656847f5ff1158f67f20.00141570430.00181292090.0015728460.3879930
2424186d5cc0b7c749f3645004df6a170.00182146760.00151649600.0015657610.2682592
a81c0b48b67c7b7ffbb424e74a7abc180.00075583510.00217199340.0015502180.1941805
7078a2866c4a4b71fcc6ed7779369c8d0.00058929040.00258327420.0015029020.3049354
d4f4f46ec402280ae29751a0c0029f510.00234543040.00071322510.0014514340.3091102
8639874b1f9d7eaf752febcf6f7ca9dd0.00171883870.00077154650.0012915270.2049717
20cfe7f61d18f6525cc71caae0ab28dc0.00075014070.00181859770.0012801200.4244153
5cea774e6a6fc2b459c324708ac5a9380.00140336230.00114666130.0012093620.2079192
11d9c9ed1d2d58a9cb0bbcdc93aa3e550.00130415330.00113006310.0011869180.2431834
4434d904522c3f016aa715d906835ac90.00216032110.00012443570.0011698970.2064970
f38996f4570cc50b2c5c017248157b5f0.00117260840.00113655500.0011485750.1746500
32f1a5ebac1fd78f3e3ba08e7e69ae220.00151867470.00077447670.0011178030.1984658
fccf9b38f0596644eab0612a5cd57597-1.967570e-04-4.818865e-04-0.00032904960.051999483
800a98e4afebc3671d3ecb61c20a9ebf-2.381818e-04-4.225240e-04-0.00033031300.058927076
6ae50439574d86851b2fcc8da95799d7-1.800000e-04-4.812834e-04-0.00033333330.039350609
8a708b7d22364ef01b63768cb3d65a47-6.480214e-04-6.198758e-05-0.00033685540.057117816
d60f98c9d37d9826084e1cced55a9603-5.121961e-04-1.199550e-04-0.00033960520.107884726
d34ae33d6d3b7263fbb06ee56f42cad5 2.885965e-04-9.209024e-04-0.00034135790.044531950
abe4d8ae6feac48213ae7ad7d91a980d-7.493886e-05-5.707904e-04-0.00034166050.125406294
69c79e8f047a62b91945789163af472a-4.838462e-04-1.492620e-04-0.00034939950.023154035
5d385e9cae0b53aac88268333085dbad-1.379310e-04-5.454545e-04-0.00035620160.017654422
3f6d2cfe63af05924d80cc830508d812-4.954611e-04-1.942857e-04-0.00036528330.052178368
b60005ac44a7a8201fd8fb0eb568e7e9-3.275899e-04-3.822693e-04-0.00037345790.065446856
69a55d967a3ed1f68b1833eb62d82fbd 4.435868e-05-8.275826e-04-0.00037552300.137407853
b6b383eaf9e218d178a7ea40745ebabb-5.544890e-04-1.805013e-04-0.00037587700.048912200
d0f645df3d7e1695f903478dabd44e28-1.333333e-04-5.869565e-04-0.00038050190.053853401
8684679687bdd5cc1e37e1b19d139640-5.254516e-05-8.603997e-04-0.00038344740.046909380
eff22d490541422627e546be4f44601f-7.255686e-04-3.886024e-05-0.00038589040.197407288
f42bcefbf37a01aef7b9483ca5bf6961-4.761905e-04-3.000000e-04-0.00039024390.008533333
848c5f08a0a08680e0f2783a9088fc53-3.250044e-04-3.856322e-04-0.00039977880.057749378
4fad93adaab891f153808884f1da1fec-4.898785e-04-2.991453e-04-0.00040746420.058284787
a4509b13baa96bceb7547508b1aa4242-7.482517e-04-1.435407e-05-0.00042513710.070461768
cceb8bd4c48dbbc2e0a83eb892990a22-3.758671e-05-8.586467e-04-0.00047757790.149899674
e8876853853f1863fd77efbd7d874967-6.835197e-04-3.198920e-04-0.00047893200.049155868
2ea17744c7eeab459b7f41d4f9e22894-5.444444e-04-4.594911e-04-0.00050721470.022866771
44d9fc4de8898b6c82c69654435a9f0b-9.093117e-04 2.803194e-04-0.00053470250.179871621
4e8b51098d3b598eadf069dcc96e81aa-6.222727e-04-3.953461e-04-0.00054019320.345341356
625ebcdc01a6ae464668e8c2b9817a12-2.653941e-04-7.847601e-04-0.00065885460.050838016
a600fd96dfa563a51e8e70a101cb6fb2-6.511640e-04-6.893139e-04-0.00069505080.089450244
4d5512974bb5e74d555bdc9cf33c8d63-8.348895e-04-9.004046e-04-0.00087052960.058151245
107063e8453bf94891c4d7a17b00af07-7.478750e-04-8.853785e-04-0.00088012040.131385216
08b21f8171b35ef001832ba1df9b8fbc-1.779978e-03-7.059238e-04-0.00121509670.188807717

We can merge this with the taxonomy:

In [19]:
tax <- data.frame(tax_table(phy))
imp <- fit$importance
imp_tax <- merge(imp, tax, by = "row.names")
imp_tax[rev(order(imp_tax$MeanDecreaseAccuracy)), ]
A data.frame: 1310 × 12
Row.names01MeanDecreaseAccuracyMeanDecreaseGiniKingdomPhylumClassOrderFamilyGenusSpecies
<I<chr>><dbl><dbl><dbl><dbl><chr><chr><chr><chr><chr><chr><chr>
41550d6182d5d321721f2d223d9757415e20.00511216460.00900250250.0069679820.9682275BacteriaFirmicutes_C Negativicutes Veillonellales Veillonellaceae Veillonella Veillonella_dispar(RS_GCF_000160015.1)
6287bc7407c94d7043a80982998cbac426d0.00442557060.00396405360.0041353490.5744608BacteriaProteobacteriaGammaproteobacteriaEnterobacterales EnterobacteriaceaeFamily_EnterobacteriaceaeFamily_Enterobacteriaceae
973c1cb63a49b75a357d6fa0766f159baa30.00205959140.00545737560.0036987600.5236623BacteriaBacteroidota Bacteroidia Bacteroidales Bacteroidaceae Bacteroides Bacteroides_eggerthii(RS_GCF_000273465.1)
36245c4f6f269ec95ff0c8dd594104519e90.00448792500.00238654510.0036211510.6068007BacteriaProteobacteriaGammaproteobacteriaEnterobacterales EnterobacteriaceaeCitrobacter Genus_Citrobacter
1055d20d3658de331c939f54d8acaed7c4c10.00052962900.00570099040.0031291860.5987402BacteriaBacteroidota Bacteroidia Bacteroidales Bacteroidaceae Bacteroides Genus_Bacteroides
1010ca5152dd2313e7fe7d25872f10c57f6d0.00195530350.00418343280.0030896870.5076726BacteriaFirmicutes Bacilli Lactobacillales Lactobacillaceae Lactobacillus_B Lactobacillus_B_ruminis(RS_GCF_001436475.1)
1182eae785b902cef8893ff591183dfa4efd0.00154539080.00391370510.0026986060.3370282BacteriaBacteroidota Bacteroidia Bacteroidales Bacteroidaceae Prevotella Prevotella_buccae(RS_GCF_000162455.1)
1068d44ac2ef51574f366f2223bb95cde1110.00167633310.00373009940.0026445480.5763823BacteriaBacteroidota Bacteroidia Bacteroidales Bacteroidaceae Bacteroides Bacteroides_rodentium(GB_GCA_000614125.1)
1018ccc3cce7144df93b827dc9a9fd18dfaf0.00217396130.00316851220.0025621800.4822958BacteriaFirmicutes Bacilli Lactobacillales Lactobacillaceae Lactobacillus_C Lactobacillus_C_casei(RS_GCF_000829055.1)
1301aa0d9eb04bfc40cf6079830cf147dea20.00197776430.00251668830.0021685770.2161084BacteriaProteobacteriaGammaproteobacteriaPseudomonadales Moraxellaceae Moraxella Moraxella_catarrhalis(RS_GCF_000092265.1)
57972bda5561e865a5539c4f19f992ee8300.00163571190.00228708860.0021211160.4508188BacteriaFirmicutes_A Clostridia Lachnospirales Lachnospiraceae Ruminococcus_B Ruminococcus_B_gnavus(RS_GCF_000526735.1)
7208c3ff6c4c4b125d5e72f5276d57be4d00.00182170750.00256719440.0020843910.4306079BacteriaFirmicutes_A Clostridia Oscillospirales Ruminococcaceae Bittarella (GB_GCA_900066655.1)
1114deb0bdec6009f88d3cb68e762aa28ae20.00043300100.00360570440.0019933730.5082881BacteriaProteobacteriaGammaproteobacteriaEnterobacterales EnterobacteriaceaeEscherichia Escherichia_sp2(RS_GCF_000407765.1)
904b411bdf72ff49585b1b1d6058244f6ee0.00207393550.00157372540.0018845470.3565262BacteriaBacteroidota Bacteroidia Bacteroidales Barnesiellaceae Barnesiella Barnesiella_intestinihominis(GB_GCA_000980475.1)
45057face48ec571894748b66cab7c52d5e0.00070472580.00325647350.0018501470.3893174BacteriaBacteroidota Bacteroidia Bacteroidales Muribaculaceae Family_Muribaculaceae Family_Muribaculaceae
8109dece060ffcff93473a7023dcee0defd0.00147968310.00227347090.0018344560.1864582BacteriaFirmicutes_A Clostridia Clostridiales Clostridiaceae Clostridium_P Clostridium_P_perfringens(RS_GCF_000009685.1)
1170e89deffbf236ceb5b5715a75ff5653620.00287562560.00100228620.0017326480.3422206BacteriaProteobacteriaGammaproteobacteriaEnterobacterales EnterobacteriaceaeEscherichia Genus_Escherichia
7138b6432496d22947e776670c0e6d2cc700.00147488620.00190722940.0016393630.2684984BacteriaBacteroidota Bacteroidia Bacteroidales Barnesiellaceae Barnesiella Barnesiella_intestinihominis(GB_GCA_000980475.1)
891099aa16dc16f656847f5ff1158f67f20.00141570430.00181292090.0015728460.3879930BacteriaProteobacteriaGammaproteobacteriaEnterobacterales EnterobacteriaceaeFamily_EnterobacteriaceaeFamily_Enterobacteriaceae
1782424186d5cc0b7c749f3645004df6a170.00182146760.00151649600.0015657610.2682592BacteriaProteobacteriaGammaproteobacteriaEnterobacterales EnterobacteriaceaeEscherichia Genus_Escherichia
863a81c0b48b67c7b7ffbb424e74a7abc180.00075583510.00217199340.0015502180.1941805BacteriaBacteroidota Bacteroidia Bacteroidales Bacteroidaceae Prevotella Prevotella_copri_A(RS_GCF_002224675.1)
5687078a2866c4a4b71fcc6ed7779369c8d0.00058929040.00258327420.0015029020.3049354BacteriaBacteroidota Bacteroidia Bacteroidales Bacteroidaceae Bacteroides_B Bacteroides_B_dorei(RS_GCF_001640865.1)
1070d4f4f46ec402280ae29751a0c0029f510.00234543040.00071322510.0014514340.3091102BacteriaProteobacteriaGammaproteobacteriaEnterobacterales EnterobacteriaceaeFamily_EnterobacteriaceaeFamily_Enterobacteriaceae
6848639874b1f9d7eaf752febcf6f7ca9dd0.00171883870.00077154650.0012915270.2049717BacteriaFirmicutes_A Clostridia Lachnospirales Lachnospiraceae Fusicatenibacter Fusicatenibacter_saccharivorans(RS_GCF_001405555.1)
16620cfe7f61d18f6525cc71caae0ab28dc0.00075014070.00181859770.0012801200.4244153BacteriaProteobacteriaGammaproteobacteriaEnterobacterales EnterobacteriaceaeEscherichia Genus_Escherichia
4785cea774e6a6fc2b459c324708ac5a9380.00140336230.00114666130.0012093620.2079192BacteriaFirmicutes_A Clostridia Oscillospirales Ruminococcaceae Bittarella (GB_GCA_900066655.1)
9911d9c9ed1d2d58a9cb0bbcdc93aa3e550.00130415330.00113006310.0011869180.2431834BacteriaFirmicutes Bacilli Lactobacillales Lactobacillaceae Lactobacillus_H Lactobacillus_H_fermentum(RS_GCF_001742205.1)
3574434d904522c3f016aa715d906835ac90.00216032110.00012443570.0011698970.2064970BacteriaProteobacteriaGammaproteobacteriaEnterobacterales EnterobacteriaceaePantoea Genus_Pantoea
1235f38996f4570cc50b2c5c017248157b5f0.00117260840.00113655500.0011485750.1746500BacteriaFirmicutes_A Clostridia Clostridiales Clostridiaceae Clostridium_P Clostridium_P_perfringens(RS_GCF_000009685.1)
24832f1a5ebac1fd78f3e3ba08e7e69ae220.00151867470.00077447670.0011178030.1984658BacteriaFirmicutes_C Negativicutes AcidaminococcalesAcidaminococcaceaeAcidaminococcus Genus_Acidaminococcus
1282fccf9b38f0596644eab0612a5cd57597-1.967570e-04-4.818865e-04-0.00032904960.051999483BacteriaFirmicutes_A Clostridia Lachnospirales Lachnospiraceae CAG-65 CAG-65_sp3(GB_GCA_900066565.1)
648800a98e4afebc3671d3ecb61c20a9ebf-2.381818e-04-4.225240e-04-0.00033031300.058927076BacteriaFirmicutes_C Negativicutes Veillonellales Veillonellaceae Veillonella_A Veillonella_A_seminalis(RS_GCF_000315505.1)
5416ae50439574d86851b2fcc8da95799d7-1.800000e-04-4.812834e-04-0.00033333330.039350609BacteriaFirmicutes_A Clostridia Lachnospirales Lachnospiraceae Family_Lachnospiraceae Family_Lachnospiraceae
7088a708b7d22364ef01b63768cb3d65a47-6.480214e-04-6.198758e-05-0.00033685540.057117816BacteriaFirmicutes Bacilli Staphylococcales Staphylococcaceae Staphylococcus Genus_Staphylococcus
1075d60f98c9d37d9826084e1cced55a9603-5.121961e-04-1.199550e-04-0.00033960520.107884726BacteriaActinobacteriotaActinobacteria Actinomycetales Micrococcaceae Rothia Rothia_mucilaginosa_A(RS_GCF_001809565.1)
1060d34ae33d6d3b7263fbb06ee56f42cad5 2.885965e-04-9.209024e-04-0.00034135790.044531950BacteriaBacteroidota Bacteroidia Bacteroidales Dysgonomonadaceae Dysgonomonas Dysgonomonas_macrotermitis(RS_GCF_001047035.1)
870abe4d8ae6feac48213ae7ad7d91a980d-7.493886e-05-5.707904e-04-0.00034166050.125406294BacteriaProteobacteria GammaproteobacteriaPseudomonadales Moraxellaceae Acinetobacter (RS_GCF_001647675.1)
53669c79e8f047a62b91945789163af472a-4.838462e-04-1.492620e-04-0.00034939950.023154035BacteriaProteobacteria AlphaproteobacteriaRF32 CAG-239 CAG-495 (GB_GCA_001917125.1)
4805d385e9cae0b53aac88268333085dbad-1.379310e-04-5.454545e-04-0.00035620160.017654422BacteriaFirmicutes Bacilli Bacillales Bacillaceae_G Bacillus_A Genus_Bacillus_A
3243f6d2cfe63af05924d80cc830508d812-4.954611e-04-1.942857e-04-0.00036528330.052178368BacteriaBacteroidota Bacteroidia Bacteroidales Marinifilaceae Butyricimonas (RS_GCF_002161485.1)
910b60005ac44a7a8201fd8fb0eb568e7e9-3.275899e-04-3.822693e-04-0.00037345790.065446856BacteriaFirmicutes_A Clostridia Lachnospirales Lachnospiraceae Ruminococcus_B Ruminococcus_B_fissicatena(RS_GCF_000190355.1)
53469a55d967a3ed1f68b1833eb62d82fbd 4.435868e-05-8.275826e-04-0.00037552300.137407853BacteriaActinobacteriotaActinobacteria Actinomycetales Bifidobacteriaceae Bifidobacterium Genus_Bifidobacterium
917b6b383eaf9e218d178a7ea40745ebabb-5.544890e-04-1.805013e-04-0.00037587700.048912200BacteriaFirmicutes_A Clostridia PeptostreptococcalesPeptostreptococcaceaeIntestinibacter Intestinibacter_bartlettii(RS_GCF_000154445.1)
1049d0f645df3d7e1695f903478dabd44e28-1.333333e-04-5.869565e-04-0.00038050190.053853401BacteriaBacteroidota Bacteroidia Bacteroidales Rikenellaceae Alistipes Genus_Alistipes
6898684679687bdd5cc1e37e1b19d139640-5.254516e-05-8.603997e-04-0.00038344740.046909380BacteriaFirmicutes_A Clostridia PeptostreptococcalesPeptostreptococcaceaeClostridioides Clostridioides_difficile_A(RS_GCF_001299635.1)
1214eff22d490541422627e546be4f44601f-7.255686e-04-3.886024e-05-0.00038589040.197407288BacteriaProteobacteria GammaproteobacteriaEnterobacterales Enterobacteriaceae Escherichia Genus_Escherichia
1239f42bcefbf37a01aef7b9483ca5bf6961-4.761905e-04-3.000000e-04-0.00039024390.008533333BacteriaFirmicutes_A Clostridia Lachnospirales Lachnospiraceae Coprococcus Coprococcus_eutactus(RS_GCF_000154425.1)
672848c5f08a0a08680e0f2783a9088fc53-3.250044e-04-3.856322e-04-0.00039977880.057749378BacteriaFirmicutes_A Clostridia Oscillospirales DTU089 Ruminococcus_E Ruminococcus_E_bromii(GB_GCA_900067015.1)
4094fad93adaab891f153808884f1da1fec-4.898785e-04-2.991453e-04-0.00040746420.058284787BacteriaFirmicutes_A Clostridia CAG-41 UBA1381 CAG-41 CAG-41_sp1(GB_GCA_900066215.1)
832a4509b13baa96bceb7547508b1aa4242-7.482517e-04-1.435407e-05-0.00042513710.070461768BacteriaActinobacteriotaCoriobacteriia Coriobacteriales Eggerthellaceae Senegalimassilia Senegalimassilia_anaerobia(RS_GCF_000236865.1)
1020cceb8bd4c48dbbc2e0a83eb892990a22-3.758671e-05-8.586467e-04-0.00047757790.149899674BacteriaBacteroidota Bacteroidia Bacteroidales Bacteroidaceae Bacteroides Bacteroides_clarus(RS_GCF_000195615.1)
1168e8876853853f1863fd77efbd7d874967-6.835197e-04-3.198920e-04-0.00047893200.049155868BacteriaFirmicutes_A Clostridia Oscillospirales Ruminococcaceae Anaerotruncus Anaerotruncus_colihominis(RS_GCF_001404495.1)
2252ea17744c7eeab459b7f41d4f9e22894-5.444444e-04-4.594911e-04-0.00050721470.022866771BacteriaFusobacteriota Fusobacteriia Fusobacteriales Fusobacteriaceae Fusobacterium Fusobacterium_periodonticum_B(RS_GCF_000163935.1)
35944d9fc4de8898b6c82c69654435a9f0b-9.093117e-04 2.803194e-04-0.00053470250.179871621BacteriaFirmicutes Bacilli Lactobacillales Lactobacillaceae Lactobacillus Lactobacillus_johnsonii(RS_GCF_000091405.1)
4044e8b51098d3b598eadf069dcc96e81aa-6.222727e-04-3.953461e-04-0.00054019320.345341356BacteriaProteobacteria GammaproteobacteriaEnterobacterales Enterobacteriaceae Family_EnterobacteriaceaeFamily_Enterobacteriaceae
504625ebcdc01a6ae464668e8c2b9817a12-2.653941e-04-7.847601e-04-0.00065885460.050838016BacteriaFirmicutes_C Negativicutes Veillonellales Dialisteraceae Dialister Dialister_micraerophilus(RS_GCF_000183445.1)
845a600fd96dfa563a51e8e70a101cb6fb2-6.511640e-04-6.893139e-04-0.00069505080.089450244BacteriaActinobacteriotaActinobacteria Actinomycetales Dermabacteraceae Dermabacter Dermabacter_hominis(RS_GCF_001570785.1)
4004d5512974bb5e74d555bdc9cf33c8d63-8.348895e-04-9.004046e-04-0.00087052960.058151245BacteriaFirmicutes_C Negativicutes Veillonellales Dialisteraceae Dialister Dialister_micraerophilus(RS_GCF_000183445.1)
86107063e8453bf94891c4d7a17b00af07-7.478750e-04-8.853785e-04-0.00088012040.131385216BacteriaFirmicutes_C Negativicutes Veillonellales Veillonellaceae Veillonella_A Veillonella_A_seminalis(RS_GCF_000315505.1)
3808b21f8171b35ef001832ba1df9b8fbc-1.779978e-03-7.059238e-04-0.00121509670.188807717BacteriaFirmicutes_C Negativicutes Veillonellales Veillonellaceae Veillonella Veillonella_dispar(RS_GCF_000160015.1)