Effective Features and Machine Learning Methods for Document Classification

Almulla Khalaf, Maysa I Abdulhussain (2019) Effective Features and Machine Learning Methods for Document Classification. PhD thesis, University of Essex.

Abstract

Document classification has been involved in a variety of applications, such as phishing and fraud detection, news categorisation, and information retrieval. This thesis aims to provide novel solutions to several important problems presented by document classification. First, an improved Principal Components Analysis (PCA), based on similarity and correlation criteria instead of covariance, is proposed, which aims to capture low-dimensional feature subset that facilitates improved performance in text classification. The experimental results have demonstrated the advantages and usefulness of the proposed method for text classification in high-dimensional feature space in terms of the number of features required to achieve the best classification accuracy. Second, two hybrid feature-subset selection methods are proposed based on the combination (via either union or intersection) of the results of both supervised (in one method) and unsupervised (in the other method) filter approaches prior to the use of a wrapper, leading to low-dimensional feature subset that can achieve both high classification accuracy and good interpretability, and spend less processing time than most current methods. The experimental results have demonstrated the effectiveness of the proposed methods for feature subset selection in high-dimensional feature space in terms of the number of selected features and the processing time spent to achieve the best classification accuracy. Third, a class-specific (supervised) pre-trained approach based on a sparse autoencoder is proposed for acquiring low-dimensional interesting structure of relevant features, which can be used for high-performance document classification. The experimental results have demonstrated the merit of this proposed method for document classification in high-dimensional feature space, in terms of the limited number of features required to achieve good classification accuracy. Finally, deep classifier structures associated with a stacked autoencoder (SAE) for higher-level feature extraction are investigated, aiming to overcome the difficulties experienced in training deep neural networks with limited training data in high-dimensional feature space, such as overfitting and vanishing/exploding gradients. This investigation has resulted in a three-stage learning algorithm for training deep neural networks. In comparison with support vector machines (SVMs) combined with SAE and Deep Multilayer Perceptron (DMLP) with random weight initialisation, the experimental results have shown the advantages and effectiveness of the proposed three-stage learning algorithm.

Item Metadata

Item Type:	Thesis (PhD)
Uncontrolled Keywords:	Document classification, Phishing and Fraud Detection, Data Mining, Machine learning, and Deep Learning
Subjects:	Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions:	Faculty of Science and Health > Computer Science and Electronic Engineering, School of
Depositing User:	Maysa Almulla Khalaf
Date Deposited:	06 Aug 2019 12:37
Last Modified:	05 Aug 2022 01:00
URI:	http://repository.essex.ac.uk/id/eprint/25105

Available files

UNSPECIFIED

Filename: Maysa_Almulla khalaf_PhD thesis 15July_2019.pdf

Download

Effective Features and Machine Learning Methods for Document Classification

Abstract

Item Metadata

Share and export

Available files

UNSPECIFIED

Statistics

Downloads