Research Repository

Entity finding in a document collection using adaptive window sizes

Alarfaj, Fawaz (2016) Entity finding in a document collection using adaptive window sizes. PhD thesis, University of Essex.

[img]
Preview
Text
thesis.pdf

Download (6MB) | Preview

Abstract

Traditional search engines work by returning a list of documents in response to queries. However, such engines are often inadequate when the information need of the user involves entities. This issue has led to the development of entity-search, which unlike normal web search does not aim at returning documents but names of people, products, organisations, etc. Some of the most successful methods for identifying relevant entities were built around the idea of a proximity search. In this thesis, we present an adaptive, well-founded, general-purpose entity finding model. In contrast to the work of other researchers, where the size of the targeted part of the document (i.e., the window size) is fixed across the collection, our method uses a number of document features to calculate an adaptive window size for each document in the collection. We construct a new entity finding test collection called the ESSEX test collection for use in evaluating our method. This collection represents a university setting as the data was collected from the publicly accessible webpages of the University of Essex. We test our method on five different datasets including the W3C Dataset, CERC Dataset, UvT/TU Datasets, ESSEX dataset and the ClueWeb09 entity finding collection. Our method provides a considerable improvement over various baseline models on all of these datasets. We also find that the document features considered for the calculation of the window size have differing impacts on the performance of the search. These impacts depend on the structure of the documents and the document language. As users may have a variety of search requirements, we show that our method is adaptable to different applications, environments, types of named entities and document collections.

Item Type: Thesis (PhD)
Subjects: Q Science > Q Science (General)
Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions: Faculty of Science and Health > Computer Science and Electronic Engineering, School of
Depositing User: Fawaz Alarfaj
Date Deposited: 08 Mar 2016 12:05
Last Modified: 08 Mar 2016 12:05
URI: http://repository.essex.ac.uk/id/eprint/16192

Actions (login required)

View Item View Item