Research Repository

Natural Language Processing methods for short informal text

Alshehabi Al-Ani, Jabir (2020) Natural Language Processing methods for short informal text. PhD thesis, University of Essex.

[img] Text
FinalSubmissionVersion.pdf - Accepted Version
Restricted to Repository staff only until 8 April 2025.

Download (4MB) | Request a copy

Abstract

The change in the English language is faster than any time before. Social media is playing a great role in this change as it has become an essential part of peoples social life. Thoughts, ideas, feelings, or even special moments are the main contents of the posts on Twitter and Facebook which are the most popular social media platforms. In this work, we addressed the change in language problem and how it affects the traditional techniques of Natural Language Processing (NLP) for this specific domain. Such a domain is considered to be a challenge for many NLP methods like topic modelling, named entity recognition, and sentiment analysis. We produced novel methods in NLP that target the short text informality. Our first novel model is in topic modelling for short messy text. The proposed model was inspired by the relation between the word's frequency and the context words frequencies (words surrounding the selected word) over time. This relation had been translated to co-occurrence patterns and stored as word embeddings after being transformed into feature space. The features had been generated from the frequencies of words and context words by our novel Term Frequency-Inverse Context Term Frequency (TF-ICTF) algorithm. TF-ICTF had been derived from the traditional standard algorithm Term Frequency-Inverse Document Frequency (TF-IDF) which did not perform well on short messy text. The proposed model is based on the words probabilities and co-occurrences between words within the short text. Therefore, we named our proposed approach the Probabilistic Relational Supervised Topic Modelling. The second approach addresses the non-standard entities in a short text. We proposed a new model using word patterns embeddings that are generated from the Twitter streamed data. These patterns should include entities that are identified by the state-of-the-art of the named entity recognition (NER) algorithms. We named our approach the Probabilistic Named Entity Recognition (PNER). PNER was trained on the identified entities in the pattern embeddings to identify the non-standard entities format. Lastly, our Probabilistic co-occurrence Relational Sentiment (PR_ Sentiment) approach proposed to sentimentally classify tweets. We used sentiment patterns detected from the short text tweets. These patterns are structured by an n-gram technique. These n-grams will be detected from sentimentally annotated tweets and labeled accordingly. The dataset that was used is a standard dataset with more than one million annotated tweets. Moreover, the PR\_ Sentiment model performs within near real-time. The aim of our project is to address the informality and non-standardization in social media short text and produce novel NLP methods. These methods were designed as a novel approach towards generalising the short messy text processing. Therefore, our methods have been tested and compared against several state-of-the-art approaches to show novelty.

Item Type: Thesis (PhD)
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions: Faculty of Science and Health > Computer Science and Electronic Engineering, School of
Depositing User: Jabir Alshehabi Al-Ani
Date Deposited: 09 Apr 2020 13:07
Last Modified: 09 Apr 2020 13:31
URI: http://repository.essex.ac.uk/id/eprint/27288

Actions (login required)

View Item View Item