Research Repository

Improving Statistical Learning within Functional Genomic Experiments by means of Feature Selection

Mahmoud, Osama (2015) Improving Statistical Learning within Functional Genomic Experiments by means of Feature Selection. PhD thesis, University of Essex.


Download (1MB) | Preview


A Statistical learning approach concerns with understanding and modelling complex datasets. Based on a given training data, its main aim is to build a model that maps the relationship between a set of input features and a considered response in a predictive way. Classification is the foremost task of such a learning process. It has applications encompassing many important fields in modern biology, including microarray data as well as other functional genomic experiments. Microarray technology allow measuring tens of thousands of genes (features) simultaneously. However, the expressions of these genes are usually observed in a small number, tens to few hundreds, of tissue samples (observations). This common characteristic of high dimensionality has a great impact on the learning processes, since most of genes are noisy, redundant or non-relevant to the considered learning task. Both the prediction accuracy and interpretability of a constructed model are believed to be enhanced by performing the learning process based only on selected informative features. Motivated by this notion, a novel statistical method, named Proportional Overlapping Scores (POS), is proposed for selecting features based on overlapping analysis of gene expression data across different classes of a considered classification task. This method results in a measure, called POS score, of a feature’s relevance to the learning task. POS is further extended to minimize the redundancy among the selected features. The proposed approaches are validated on several publicly available gene expression datasets using widely used classifiers to observe effects on their prediction accuracy. Selection stability is also examined to address the captured biological knowledge in the obtained results. The experimental results of classification error rates computed using the Random Forest, k NearestNeighbor and Support VectorMachine classifiers show that the proposals achieve a better performance than widely used gene selection methods.

Item Type: Thesis (PhD)
Subjects: Q Science > QA Mathematics
Divisions: Faculty of Science and Health > Mathematical Sciences, Department of
Depositing User: Jim Jamieson
Date Deposited: 22 Jul 2019 09:38
Last Modified: 22 Jul 2019 09:38

Actions (login required)

View Item View Item