Pelicon, Andraž and Shekhar, Ravi and Škrlj, Blaž and Purver, Matthew and Pollak, Senja (2021) Investigating cross-lingual training for offensive language detection. PeerJ Computer Science, 7. e559-e559. DOI https://doi.org/10.7717/peerj-cs.559
Pelicon, Andraž and Shekhar, Ravi and Škrlj, Blaž and Purver, Matthew and Pollak, Senja (2021) Investigating cross-lingual training for offensive language detection. PeerJ Computer Science, 7. e559-e559. DOI https://doi.org/10.7717/peerj-cs.559
Pelicon, Andraž and Shekhar, Ravi and Škrlj, Blaž and Purver, Matthew and Pollak, Senja (2021) Investigating cross-lingual training for offensive language detection. PeerJ Computer Science, 7. e559-e559. DOI https://doi.org/10.7717/peerj-cs.559
Abstract
Platforms that feature user-generated content (social media, online forums, newspaper comment sections etc.) have to detect and filter offensive speech within large, fast-changing datasets. While many automatic methods have been proposed and achieve good accuracies, most of these focus on the English language, and are hard to apply directly to languages in which few labeled datasets exist. Recent work has therefore investigated the use of <i>cross-lingual transfer learning</i> to solve this problem, training a model in a well-resourced language and transferring to a less-resourced target language; but performance has so far been significantly less impressive. In this paper, we investigate the reasons for this performance drop, via a systematic comparison of pre-trained models and intermediate training regimes on five different languages. We show that using a better pre-trained language model results in a large gain in overall performance and in zero-shot transfer, and that intermediate training on other languages is effective when little target-language data is available. We then use multiple analyses of classifier confidence and language model vocabulary to shed light on exactly where these gains come from and gain insight into the sources of the most typical mistakes.
Item Type: | Article |
---|---|
Uncontrolled Keywords: | Cross-lingual models; Transfer learning; Intermediate training; Offensive language detection; Deep learning |
Divisions: | Faculty of Science and Health Faculty of Science and Health > Computer Science and Electronic Engineering, School of |
SWORD Depositor: | Unnamed user with email elements@essex.ac.uk |
Depositing User: | Unnamed user with email elements@essex.ac.uk |
Date Deposited: | 24 Sep 2023 09:43 |
Last Modified: | 30 Oct 2024 16:22 |
URI: | http://repository.essex.ac.uk/id/eprint/35253 |
Available files
Filename: Investigating cross-lingual training for offensive language detection.pdf
Licence: Creative Commons: Attribution 4.0