Maqsood, Khizra (2025) Explainable Artificial Intelligence for the identification of novel mammalian enhancers and their epigenetic code. Doctoral thesis, University of Essex. DOI https://doi.org/10.5526/ERR-00041350
Maqsood, Khizra (2025) Explainable Artificial Intelligence for the identification of novel mammalian enhancers and their epigenetic code. Doctoral thesis, University of Essex. DOI https://doi.org/10.5526/ERR-00041350
Maqsood, Khizra (2025) Explainable Artificial Intelligence for the identification of novel mammalian enhancers and their epigenetic code. Doctoral thesis, University of Essex. DOI https://doi.org/10.5526/ERR-00041350
Abstract
Enhancers are non-coding regions of the genome responsible for controlling the activity of genes. Approximately 80-90% of mutations causing human diseases (including cancers) are located in the non-coding part of the human genome, often in enhancers. To provide a better understanding of how these mutations lead to disease states, both experimental and machine learning approaches have been developed to annotate enhancers. However, these techniques fail to provide genomics and clinical researchers with an accurate explanation of how these models predict specific regions as enhancers over other regions. Hence, there is a need for eXplainable Artificial Intelligence (XAI) IF/THEN rules systems that can be easily understood, analysed, and augmented by domain experts. In this thesis, we developed several Artificial Intelligence (AI) models (Convolutional Neural Networks (CNNs), XGBoost, Logistic Regression and a Type-2 Fuzzy logic rule-based eXplainable Artificial Intelligence (XAI)) for enhancer prediction in different human and mouse cell lines. While all models display high accuracy, only our XAI models (AUC 0.79) are explainable and provide a set of IF/THEN rules that decipher the underlying combinatorial epigenetic code of enhancers. Furthermore, our results confirmed that only XAI and partially CNN perform consistently well in the other two human cell lines, i.e. K562 and IMR90, with an AUC (XAI: 0.73, CNN: 0.7 & 0.54). The other opaque models did not generalise well (i.e. logistic regression (AUC: 0.67) and XGBoost (AUC: 0.55)), further supporting the generalisation abilities of the XAI model. Our AI models identified many novel enhancers (i.e. H3K18ac and H3K14ac), which display the same epigenetic signatures as experimentally identified ones. Interestingly, seven Features i.e. (epigenetic marks) in human (XAI AUC: 0.79) and five features in mouse (XAI AUC: 0.8) are sufficient to annotate enhancers without losing accuracy. Furthermore, the 7 epigenetic mark minimal human model was applied to annotate enhancers in 10 brain tumour (glioblastoma) patient-derived lines. Most importantly, the XAI provides insights on the specific combinations of epigenetic modifications that classify enhancers instead of only providing the importance of one or several features. In particular, we present an interpretable IF/THEN rule architecture that helps us model how features interact in high-dimensional biological data, handling some major drawbacks of post-hoc explainability methods like SHAP and LIME. Unlike ML methods that just hand out static importance scores or look at features individually, our rule base lays out combinatorial logic in a clear way. It shows how epigenetic markers work together to activate enhancers (e.g. “IF H3K27ac is high enriched and H3K4me1 is high enriched in a genomic region, THEN the region is defined as an enhancer region”). The approach clears up the confusion around feature importance, while univariate analysis might wrongly classify epigenetic markers because of inconsistent individual correlations, our rules pinpoint their predictive strength only when they show up alongside other markers. This highlights how linear or additive models can miss out on crucial conditional dependencies. This framework has significant implications for personalised medicines, by mapping patient-specific epigenetic profiles to rule-based logic clinician could identify individualised enhancers activation patterns linked to disease. For example, a tumour might show enhancers that are classified by the following rule: “IF H3K18ac and H3K14ac is high enriched in the genome”, instead of the usual individual pairing of the H3K27ac and H3K4me1. This points to the possibility of personalised therapies, as by enabling clinicians to develop drugs that specifically control the enrichment level of these enhancer markers. For example, A personalised drug can be designed to modulate epigenetic modifications, specifically by reducing H3K18ac enrichment from high to low and adjusting H3K14ac enrichment from high to medium. This targeted regulation ultimately disrupts key oncogenic pathways, thereby inhibiting tumour growth. This level of explainability is unattainable with methods like SHAP or LIME, as they don't have the framework needed to suggest actionable biomarkers that depend on specific conditions
Item Type: | Thesis (Doctoral) |
---|---|
Subjects: | Q Science > QA Mathematics > QA76 Computer software |
Divisions: | Faculty of Science and Health > Computer Science and Electronic Engineering, School of |
Depositing User: | Khizra Maqsood |
Date Deposited: | 31 Jul 2025 08:45 |
Last Modified: | 31 Jul 2025 08:45 |
URI: | http://repository.essex.ac.uk/id/eprint/41350 |