Osuntoki, Godwin Itunu (2022) A Hidden Markov Random Field based Bayesian Approach for the Detection of Chromatin Interactions in Hi-C Data. PhD thesis, University of Essex.
Osuntoki, Godwin Itunu (2022) A Hidden Markov Random Field based Bayesian Approach for the Detection of Chromatin Interactions in Hi-C Data. PhD thesis, University of Essex.
Osuntoki, Godwin Itunu (2022) A Hidden Markov Random Field based Bayesian Approach for the Detection of Chromatin Interactions in Hi-C Data. PhD thesis, University of Essex.
Abstract
Background: This thesis focuses on the statistical analysis of chromatin interactions using Hi-C data produced by Next Generation Sequencing (NGS). Hi-C is a chromosome conformation capture technique (3C-based method) that aims to analyze the spatial organization of chromatin in a cell. Specifically, the method identifies interacting pairs of fragments in the genome. The dataset produced by the Hi-C experiment is in the form of pairs of fragments in a symmetric matrix, which are called contact counts or contact frequency or frequency counts. The counts are genome-wide for every genomic position. There are sources of bias related to the Hi-C data [160], and in this research, we focus on four of them (Distance, GC-content, Transposable elements, and Accessibility). In recent years, attempts have been made at developing computational techniques capable of modelling these sources of bias or detecting significant interactions specifically in Hi-C data. Methods: The present research modelled these sources of bias as covariates within a regression model. Modelling these biases as a regression model allows for a better understanding of their effect on chromatin interactions within the genome. Furthermore, we propose the Potts model [157] which allows us to introduce spatial dependency by borrowing information from neighbouring loci pairs. Also, the introduction of the Potts model allows us to increase the number of components in which we can classify the contact counts from the previously assume two components by existing studies. Finally, we use the deviance information criterion (DIC) to select a preferred distribution for genome-wide analysis. Results: Firstly, we modelled the sources of bias, genomic distances, GC-content, Transposable elements, and Accessibility as a regression model. Our result shows that the genomic distance between interacting loci is the major source of bias and that the effects of the sources of biases depend on the component. Secondly, we assume that the density of the contact frequency first follows a Zero Inflated Poisson (ZIP) and then a Negative Binomial distribution. For the unobserved information, (latent variable), we assume the Potts model. Based on our results, including the calculation of the DIC, the ZIP distribution outperforms the NB distribution. Thirdly, comparative analysis when we assume the number of components to be two and when we assume the number of components to be three using the DIC revealed that increasing the number of components improves the detection of significant information in the Hi-C data. Fourthly, the genome-wide analysis of Drosophila melanogaster reveals that the majority of significant interactions are found within inter-TADs, that is outside TADs of the same anchor, and also the majority of significant interactions are long-range interactions. Conclusion: Our results provide clear evidence that the genome of Drosophila melanogaster can be classified into more than two components, noise and signal interactions, and that in addition to this, the effects of the sources of bias depends on the component.
Item Type: | Thesis (PhD) |
---|---|
Uncontrolled Keywords: | Bayesian, Hi-C, Potts, HMRF, 3C, chromatin, ABC, DIC |
Subjects: | Q Science > QA Mathematics |
Divisions: | Faculty of Science and Health > Mathematical Sciences, Department of |
Depositing User: | Itunu Osuntoki |
Date Deposited: | 26 Jan 2022 10:19 |
Last Modified: | 26 Jan 2022 10:19 |
URI: | http://repository.essex.ac.uk/id/eprint/32127 |