Huang, Guangming and Long, Yunfei and Luo, Cunjin and Li, Yingya (2023) LIDA: Lexical-based Imbalanced Data Augmentation for Content Moderation. In: The 37th Pacific Asia Conference on Language, Information and Computation (PACLIC 37), 2023-12-02 - 2023-12-05, Hong Kong. (In Press)
Huang, Guangming and Long, Yunfei and Luo, Cunjin and Li, Yingya (2023) LIDA: Lexical-based Imbalanced Data Augmentation for Content Moderation. In: The 37th Pacific Asia Conference on Language, Information and Computation (PACLIC 37), 2023-12-02 - 2023-12-05, Hong Kong. (In Press)
Huang, Guangming and Long, Yunfei and Luo, Cunjin and Li, Yingya (2023) LIDA: Lexical-based Imbalanced Data Augmentation for Content Moderation. In: The 37th Pacific Asia Conference on Language, Information and Computation (PACLIC 37), 2023-12-02 - 2023-12-05, Hong Kong. (In Press)
Abstract
Data augmentation (DA) has attracted considerable attention as an alternative for collecting more data without additional human annotation efforts, particularly in low-resource, sensitive, and class-imbalanced tasks. However, the majority of current approaches are designed for the general domain with often balanced data, while in specific tasks like content moderation, the data is often with a skewed distribution. The situation is further exacerbated by data sensitivity, making it unlikely or costly to obtain additional human annotations. To fill this research gap, our paper presents a lexical-based imbalanced data augmentation (LIDA) approach for content moderation. LIDA is an easy-to-implement and explainable DA method that utilizes sensitive lexicons and randomly inserts sensitive lexicons into negative samples for converting them into positive ones. In this way, LIDA can achieve a balanced dataset for avoiding skewed distribution problems. We validate our model on two datasets, namely Wiki-TOX and Wiki-ATT, to show the superior performance of our proposed algorithm compared to other rule-based data augmentation baselines, and p-values are presented to demonstrate its effectiveness and stability.
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Additional Information: | Published proceedings: _not provided_ |
Divisions: | Faculty of Science and Health Faculty of Science and Health > Computer Science and Electronic Engineering, School of |
SWORD Depositor: | Unnamed user with email elements@essex.ac.uk |
Depositing User: | Unnamed user with email elements@essex.ac.uk |
Date Deposited: | 03 Oct 2023 14:53 |
Last Modified: | 05 Jan 2024 22:10 |
URI: | http://repository.essex.ac.uk/id/eprint/36339 |
Available files
Filename: LIDA__Lexical_based_Imbalanced_Data_Augmentation_for_Content_Moderation__PACLIC_2023_.pdf