Research Repository

Supervised Learning Methods for 16S rRNA based Functional Annotation

Kulakowski, Rafal (2021) Supervised Learning Methods for 16S rRNA based Functional Annotation. PhD thesis, University of Essex.

[img] Text
PhD Thesis Rafal Kulakowski.pdf
Restricted to Repository staff only until 20 October 2024.

Download (3MB) | Request a copy


Background: A 16S rRNA sequence represents a marker gene commonly used for taxonomic annotation of bacteria and archaea. The developments in Next-Generation-Sequencing technologies allowed researchers to obtain large volumes of 16S rRNA from environmental samples of microbial communities. This data is predominantly used for obtaining taxonomic compositions of microbial populations. How-ever, in recent years attempts have been made at developing computational tools capable of annotating functional or metabolic labels from 16S rRNA data. Methods: The present research implements supervised learning algorithms for building classification models that use 16S rRNA sequences as input and deliver predicted set of functional labels as output. Based on a large dataset constructed by applying FAPROTAX tool, sets of computational experiments were performed to evaluate a range of possible techniques that could be used to construct a robust classification pipeline. Results: The first set of results demonstrated the validity of the approach and revealed that a supervised-learning-based classification pipeline would benefit from implementing advanced supervised learning algorithms, such as Random Forest and nonlinear Support Vector Machines. Secondly, the experiments which studied the effects of applying alignment-based and alignment-free approaches to data pre-processing of 16S rRNA sequences indicated that the former approach delivers preferable comparison space than the latter one and that combining both approaches provide no benefit. The final set of computational experiments, in which classification models have been trained for each FAPROTAX functional label, demonstrated problems with detecting low prevalence functional traits. The implementation of latest techniques for generating synthetic data points revealed that these problems can, to some extent, be mitigated. The best performing DBSMOTE technique combined with the Random Forest algorithm allowed to train reliable classifiers for 48 functional traits. Conclusion: Our results provide clear evidence of the validity of supervised-learning-based approach to 16S rRNA based functional annotation, demonstrating that in addition to being a reliable phylogenetic marker, this gene can also be used to detect a range of metabolic traits

Item Type: Thesis (PhD)
Uncontrolled Keywords: 16S rRNA, Supervised Learning, Biological Sequence Classification, Machine Learning, Data Science, Functional Annotation.
Subjects: Q Science > QA Mathematics
Q Science > QR Microbiology
Divisions: Faculty of Science and Health > Mathematical Sciences, Department of
Depositing User: Rafal Kulakowski
Date Deposited: 21 Oct 2021 16:56
Last Modified: 21 Oct 2021 16:56

Actions (login required)

View Item View Item