Althobaiti, Maha (2016) Minimally-supervised Methods for Arabic Named Entity Recognition. PhD thesis, University of Essex.
Althobaiti, Maha (2016) Minimally-supervised Methods for Arabic Named Entity Recognition. PhD thesis, University of Essex.
Althobaiti, Maha (2016) Minimally-supervised Methods for Arabic Named Entity Recognition. PhD thesis, University of Essex.
Abstract
Named Entity Recognition (NER) has attracted much attention over the past twenty years, as a main task of Information Extraction. The current dominant techniques for addressing NER are supervised methods that can achieve high performance, but require new manually annotated data for every new domain and/or genre change. Our work focuses on approaches that make it possible to tackle new domains with minimal human intervention to identify Named Entities (NEs) in Arabic text. Specifically, we investigate two minimally-supervised methods: semi-supervised learning and distant learning. Our semi-supervised algorithm for identifying NEs does not require annotated training data or gazetteers. It only requires, for each NE type, a seed list of a few instances to initiate the learning process. Novel aspects of our algorithm include (i) a new way to produce and generalise the extraction patterns (ii) a new filtering criterion to remove noisy patterns (iii) a comparison of two ranking measures for determining the most reliable candidate NEs. Next, we present our methodology to exploit Wikipedia structure to automatically develop an Arabic NE annotated corpus. A novel mechanism is introduced, based on the high coverage of Wikipedia, in order to address two challenges particular to tagging NEs in Arabic text: rich morphology and the absence of capitalisation. Neither technique has yet achieved performance levels comparable to those of supervised methods. Semi-supervised algorithms tend to have high precision but comparatively low recall, whereas distant learning tends to achieve higher recall but lower precision. Therefore, we present a novel approach to Arabic NER using a combination of semi-supervised and distant learning techniques. We used a variety of classifier combination schemes, including the Bayesian Classifier Combination (BCC) procedure, recently proposed for sentiment analysis. According to our results, the BCC model leads to an increase in performance of 8 percentage points over the best minimally-supervised classifier.
Item Type: | Thesis (PhD) |
---|---|
Subjects: | Q Science > QA Mathematics > QA75 Electronic computers. Computer science |
Divisions: | Faculty of Science and Health > Computer Science and Electronic Engineering, School of |
Depositing User: | Maha Althobaiti |
Date Deposited: | 02 Mar 2016 12:38 |
Last Modified: | 02 Mar 2016 12:38 |
URI: | http://repository.essex.ac.uk/id/eprint/16144 |
Available files
Filename: Maha_Jarallah_Althobaiti_PhD_Thesis_2016.pdf