Modelling phylogeny in 16S rRNA gene sequencing datasets using string-based kernels

Jonathan Ish-Horowicz, Sarah Filippi

公開日: 2022/10/14

Abstract

The bacterial microbiome is increasingly being recognised as a key factor in human health, driven in large part by datasets collected using 16S rRNA (ribosomal ribonucleic acid) gene sequencing, which enable cost-effective quantification of the composition of an individual's bacterial community. One of the defining characteristics of 16S rRNA datasets is the evolutionary relationships that exist between taxa (phylogeny). Here, we demonstrate the utility of modelling these phylogenetic relationships in two statistical tasks (the two sample test and host trait prediction) and propose a novel family of kernels for analysing microbiome datasets by leveraging string kernels from the natural language processing literature. We show via simulation studies that a kernel two-sample test using the proposed kernel is sensitive to the phylogenetic scale of the difference between the two populations. In a second set of simulations we also show how Gaussian process modelling with string kernels can infer the distribution of bacterial-host effects across the phylogenetic tree \new{and apply this approach to a real host-trait prediction task.} The results in the paper can be reproduced by running the code at https://github.com/jonathanishhorowicz/modelling_phylogeny_in_16srrna_using_string_kernels.

Modelling phylogeny in 16S rRNA gene sequencing datasets using string-based kernels | SummarXiv | SummarXiv