Akhter, Muhammad Pervez and Jiangbin, Zheng and Naqvi, Irfan Raza and Abdelmajeed, Mohammed and Mehmood, Atif and Sadiq, Muhammad Tariq (2020) Document-Level Text Classification Using Single-Layer Multisize Filters Convolutional Neural Network. IEEE Access, 8. pp. 42689-42707. DOI https://doi.org/10.1109/access.2020.2976744
Akhter, Muhammad Pervez and Jiangbin, Zheng and Naqvi, Irfan Raza and Abdelmajeed, Mohammed and Mehmood, Atif and Sadiq, Muhammad Tariq (2020) Document-Level Text Classification Using Single-Layer Multisize Filters Convolutional Neural Network. IEEE Access, 8. pp. 42689-42707. DOI https://doi.org/10.1109/access.2020.2976744
Akhter, Muhammad Pervez and Jiangbin, Zheng and Naqvi, Irfan Raza and Abdelmajeed, Mohammed and Mehmood, Atif and Sadiq, Muhammad Tariq (2020) Document-Level Text Classification Using Single-Layer Multisize Filters Convolutional Neural Network. IEEE Access, 8. pp. 42689-42707. DOI https://doi.org/10.1109/access.2020.2976744
Abstract
The rapid growth of electronic documents are causing problems like unstructured data that need more time and effort to search a relevant document. Text Document Classification (TDC) has a great significance in information processing and retrieval where unstructured documents are organized into pre-defined classes. Urdu is the most favorite research language in South Asian languages because of its complex morphology, unique features, and lack of linguistic resources like standard datasets. As compared to short text, like sentiment analysis, long text classification needs more time and effort because of large vocabulary, more noise, and redundant information. Machine Learning (ML) and Deep Learning (DL) models have been widely used in text processing. Despite the major limitations of ML models, like learn directed features, these are the favorite methods for Urdu TDC. To the best of our knowledge, it is the first study of Urdu TDC using DL model. In this paper, we design a large multi-purpose and multi-format dataset that contain more than ten thousand documents organize into six classes. We use Single-layer Multisize Filters Convolutional Neural Network (SMFCNN) for classification and compare its performance with sixteen ML baseline models on three imbalanced datasets of various sizes. Further, we analyze the effects of preprocessing methods on SMFCNN performance. SMFCNN outperformed the baseline classifiers and achieved 95.4%, 91.8%, and 93.3% scores of accuracy on medium, large and small size dataset respectively. The designed dataset would be publically and freely available in different formats for future research in Urdu text processing.
Item Type: | Article |
---|---|
Uncontrolled Keywords: | Convolutional neural network; deep learning; machine learning; natural language processing; text document classification; Urdu text classification |
Divisions: | Faculty of Science and Health Faculty of Science and Health > Computer Science and Electronic Engineering, School of |
SWORD Depositor: | Unnamed user with email elements@essex.ac.uk |
Depositing User: | Unnamed user with email elements@essex.ac.uk |
Date Deposited: | 01 Aug 2024 10:49 |
Last Modified: | 30 Oct 2024 21:37 |
URI: | http://repository.essex.ac.uk/id/eprint/38040 |
Available files
Filename: Document-Level_Text_Classification_Using_Single-Layer_Multisize_Filters_Convolutional_Neural_Network.pdf
Licence: Creative Commons: Attribution 4.0