Akhter, Muhammad Pervez and Jiangbin, Zheng and Naqvi, Irfan Raza and Abdelmajeed, Mohammed and Sadiq, Muhammad Tariq (2024) Automatic Detection of Offensive Language for Urdu and Roman Urdu. IEEE Access, 8. pp. 91213-91226. DOI https://doi.org/10.1109/access.2020.2994950
Akhter, Muhammad Pervez and Jiangbin, Zheng and Naqvi, Irfan Raza and Abdelmajeed, Mohammed and Sadiq, Muhammad Tariq (2024) Automatic Detection of Offensive Language for Urdu and Roman Urdu. IEEE Access, 8. pp. 91213-91226. DOI https://doi.org/10.1109/access.2020.2994950
Akhter, Muhammad Pervez and Jiangbin, Zheng and Naqvi, Irfan Raza and Abdelmajeed, Mohammed and Sadiq, Muhammad Tariq (2024) Automatic Detection of Offensive Language for Urdu and Roman Urdu. IEEE Access, 8. pp. 91213-91226. DOI https://doi.org/10.1109/access.2020.2994950
Abstract
In recent years, unethical behavior in the cyber-environment has been revealed. The presence of offensive language on social media platforms and automatic detection of such language is becoming a major challenge in modern society. The complexity of natural language constructs makes this task even more challenging. Until now, most of the research has focused on resource-rich languages like English. Roman Urdu and Urdu are two scripts of writing the Urdu language on social media. The Roman script uses the English language characters while the Urdu script uses Urdu language characters. Urdu and Hindi languages are similar with the only difference in their writing script but the Roman scripts of both languages are similar. This study is about the detection of offensive language from the user's comments presented in a resource-poor language Urdu. We propose the first offensive dataset of Urdu containing user-generated comments from social media. We use individual and combined n-grams techniques to extract features at character-level and word-level. We apply seventeen classifiers from seven machine learning techniques to detect offensive language from both Urdu and Roman Urdu text comments. Experiments show that the regression-based models using character n-grams show superior performance to process the Urdu language. Character-level tri-gram outperforms the other word and character n-grams. LogitBoost and SimpleLogistic outperform the other models and achieve 99.2% and 95.9% values of F-measure on Roman Urdu and Urdu datasets respectively. Our designed dataset is publically available on GitHub for future research.
Item Type: | Article |
---|---|
Uncontrolled Keywords: | Machine learning; YouTube; Feature extraction; Videos; Writing; Twitter; Social media; offensive language detection; natural language Processing; text processing |
Divisions: | Faculty of Science and Health Faculty of Science and Health > Computer Science and Electronic Engineering, School of |
SWORD Depositor: | Unnamed user with email elements@essex.ac.uk |
Depositing User: | Unnamed user with email elements@essex.ac.uk |
Date Deposited: | 01 Aug 2024 10:43 |
Last Modified: | 30 Oct 2024 21:37 |
URI: | http://repository.essex.ac.uk/id/eprint/38041 |
Available files
Filename: Automatic_Detection_of_Offensive_Language_for_Urdu_and_Roman_Urdu.pdf
Licence: Creative Commons: Attribution 4.0