Research Repository

Estimating transcription factor binding properties in human cell lines using statistical methods and genomics datasets

Pop, Romana T (2021) Estimating transcription factor binding properties in human cell lines using statistical methods and genomics datasets. Masters thesis, University of Essex.

[img] Text
Restricted to Repository staff only until 12 January 2024.

Download (615kB) | Request a copy


Site specific transcription factors recognise and bind DNA motifs to regulate gene expression. Therefore, it is important to understand how and where they interact with the genome. Besides DNA sequence, chromatin accessibility, CpG methylation and cooperative binding with other transcription factors or themselves also impacts transcription factor binding. The era of high throughput sequencing has brought large amounts of genomic data, including chromatin immunoprecipitation and sequencing data for transcription factor binding. As a result, bioinformatics and machine learning tools have become popular for genomic data analysis. When investigating transcription factor activity, it is not enough to understand their function, but understanding the mechanisms behind it is also necessary. Explainable bioinformatics models facilitate the unravelling of mechanistic processes. ChIPanalyser is an R/Bioconductor package that implements a statistical thermodynamics model for transcription factor binding by leveraging binding motifs, chromatin accessibility and transcription factor concentration. This study aimed to use ChIPanalyser on 135 human transcription factors in the K562 cell line and investigate their chromatin accessibility preferences. Quantile density accessibility was used to determine how transcription factor binding changed when considering different levels of chromatin accessibility. In total, 12 quantiles were used and their goodness of fit was determined by AUC. The transcription factors were clustered into four groups based on their AUC trends over all quantiles using two algorithms: k-means and a bespoke algorithm. The four clusters were (i) “pioneer”, containing factors that were indifferent to variations in accessibility, (ii) “partial pioneer”, containing factors with a slight preference for open chromatin, (iii) “traditional”, containing factors with a strong preference for open chromatin, and (iv) “poorly predicted”, containing factors poorly predicted by the model regardless of accessibility. The two methods varied somewhat in their classification, with the “pioneer” and “partial pioneer” groups being larger when using the k-means. This study provided insight into the relationship between transcription factor chromatin accessibility preference and their function, and opened the possibility for further study.

Item Type: Thesis (Masters)
Divisions: Faculty of Science and Health > Life Sciences, School of
Depositing User: Romana-Tabita Pop
Date Deposited: 12 Jan 2021 12:38
Last Modified: 12 Jan 2021 12:38

Actions (login required)

View Item View Item