Research Repository

Active Expert Sourcing; Knowledge Extraction from Domain Specific Information

Alghamdi, Ans (2019) Active Expert Sourcing; Knowledge Extraction from Domain Specific Information. PhD thesis, University of Essex.

[img] Text
Active Expert Sourcing.pdf - Accepted Version
Restricted to Repository staff only until 5 September 2024.

Download (9MB) | Request a copy

Abstract

The development of Named Entity Recognition (NER) in recent years is partially attributed to the availability of annotated ata-sets. Data-sets play a crucial part indeveloping, training, and testing NER algorithms. The need for data-sets becomes more important when adapting the algorithms to new domains. However, domain specific information imposes different challenges on NERs, such as the need for annotating a different set of Named Entity (NE) types (e.g. NE schema) or, more importantly, the need for domain expert annotators. Many domain specific NER use academic paper-sharing platforms as sources for data-sets. Either abstracts or the full texts of publications are extracted from the platforms to construct raw data-sets. These raw data-sets are then annotated by domain experts. However, expert annotation is an expensive process and consumes more resources compared to non-expert annotation. This thesis tackles the problem of adapting NER to new domains and focuses on reducing the resources needed to create domain specific NER. In this thesis, academic paper-sharing portals are used as a source for raw data and also as a source for finding annotators. In other words, paper-sharing platforms are used as a crowdsourcing platform, and the scholars who share their publications are asked to annotate their own work. This thesis uses also active learning (AL) to further reduce the resources needed to develop NER. In the introduced approach, experts submit their papers online. The papers then go through a Natural Language Processing (NLP) pipeline that prepares the papers’ text for annotating. An active learning algorithm, as part of this pipeline, selects the most informative instances to be annotated. The author is then asked to annotate these instances. The developed NER approach is in a consistent loop. The loop is used to produce more annotated resources and to improve the NER model. Two empirical experiments are conducted: one is a real-world experiment, and the other is a simulation. The real-world experiment tackles the archaeological domain. In this experiment, an NER is developed for two languages: English and Italian. The second experiment is in the biomedical domain, and an already annotated data-set is used to simulate the approach presented in this thesis. The results of the experiments suggest that the approach used in this thesis is a promising candidate for developing domain specific NER, as it achieved results that are significantly higher than the baseline interm of the F-score.

Item Type: Thesis (PhD)
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions: Faculty of Science and Health > Computer Science and Electronic Engineering, School of
Depositing User: Ans Alghamdi
Date Deposited: 06 Sep 2019 08:51
Last Modified: 06 Sep 2019 12:09
URI: http://repository.essex.ac.uk/id/eprint/25275

Actions (login required)

View Item View Item