Research Repository

A Hidden Markov Random Field based Bayesian Approach for the Detection of Chromatin Interactions in Hi-C Data

Osuntoki, Godwin Itunu (2022) A Hidden Markov Random Field based Bayesian Approach for the Detection of Chromatin Interactions in Hi-C Data. PhD thesis, University of Essex.

[img] Text
Osuntoki_Itunu_Thesis.pdf
Restricted to Repository staff only until 25 January 2027.

Download (12MB) | Request a copy

Abstract

Background: This thesis focuses on the statistical analysis of chromatin interactions using Hi-C data produced by Next Generation Sequencing (NGS). Hi-C is a chromosome conformation capture technique (3C-based method) that aims to analyze the spatial organization of chromatin in a cell. Specifically, the method identifies interacting pairs of fragments in the genome. The dataset produced by the Hi-C experiment is in the form of pairs of fragments in a symmetric matrix, which are called contact counts or contact frequency or frequency counts. The counts are genome-wide for every genomic position. There are sources of bias related to the Hi-C data [160], and in this research, we focus on four of them (Distance, GC-content, Transposable elements, and Accessibility). In recent years, attempts have been made at developing computational techniques capable of modelling these sources of bias or detecting significant interactions specifically in Hi-C data. Methods: The present research modelled these sources of bias as covariates within a regression model. Modelling these biases as a regression model allows for a better understanding of their effect on chromatin interactions within the genome. Furthermore, we propose the Potts model [157] which allows us to introduce spatial dependency by borrowing information from neighbouring loci pairs. Also, the introduction of the Potts model allows us to increase the number of components in which we can classify the contact counts from the previously assume two components by existing studies. Finally, we use the deviance information criterion (DIC) to select a preferred distribution for genome-wide analysis. Results: Firstly, we modelled the sources of bias, genomic distances, GC-content, Transposable elements, and Accessibility as a regression model. Our result shows that the genomic distance between interacting loci is the major source of bias and that the effects of the sources of biases depend on the component. Secondly, we assume that the density of the contact frequency first follows a Zero Inflated Poisson (ZIP) and then a Negative Binomial distribution. For the unobserved information, (latent variable), we assume the Potts model. Based on our results, including the calculation of the DIC, the ZIP distribution outperforms the NB distribution. Thirdly, comparative analysis when we assume the number of components to be two and when we assume the number of components to be three using the DIC revealed that increasing the number of components improves the detection of significant information in the Hi-C data. Fourthly, the genome-wide analysis of Drosophila melanogaster reveals that the majority of significant interactions are found within inter-TADs, that is outside TADs of the same anchor, and also the majority of significant interactions are long-range interactions. Conclusion: Our results provide clear evidence that the genome of Drosophila melanogaster can be classified into more than two components, noise and signal interactions, and that in addition to this, the effects of the sources of bias depends on the component.

Item Type: Thesis (PhD)
Uncontrolled Keywords: Bayesian, Hi-C, Potts, HMRF, 3C, chromatin, ABC, DIC
Subjects: Q Science > QA Mathematics
Divisions: Faculty of Science and Health > Mathematical Sciences, Department of
Depositing User: Itunu Osuntoki
Date Deposited: 26 Jan 2022 10:19
Last Modified: 26 Jan 2022 10:19
URI: http://repository.essex.ac.uk/id/eprint/32127

Actions (login required)

View Item View Item