Venugopal, Gayatri and Pramod, Dhanya and Shekhar, Ravi (2022) CWID-hi: A Dataset for Complex Word Identification in Hindi Text. In: Thirteenth Language Resources and Evaluation Conference, 2022-06-20 - 2022-06-25, Marseille, France.
Venugopal, Gayatri and Pramod, Dhanya and Shekhar, Ravi (2022) CWID-hi: A Dataset for Complex Word Identification in Hindi Text. In: Thirteenth Language Resources and Evaluation Conference, 2022-06-20 - 2022-06-25, Marseille, France.
Venugopal, Gayatri and Pramod, Dhanya and Shekhar, Ravi (2022) CWID-hi: A Dataset for Complex Word Identification in Hindi Text. In: Thirteenth Language Resources and Evaluation Conference, 2022-06-20 - 2022-06-25, Marseille, France.
Abstract
Text simplification is a method for improving the accessibility of text by converting complex sentences into simple sentences. Multiple studies have been done to create datasets for text simplification. However, most of these datasets focus on high-resource languages only. In this work, we proposed a complex word dataset for Hindi, a language largely ignored in text simplification literature. We used various Hindi knowledge annotators for annotation to capture the annotator's language knowledge. Our analysis shows a significant difference between native and non-native annotators' perception of word complexity. We also built an automatic complex word classifier using a soft voting approach based on the predictions from tree-based ensemble classifiers. These models behave differently for annotations made by different categories of users, such as native and non-native speakers. Our dataset and analysis will help simplify Hindi text depending on the user's language understanding. The dataset is available at https://zenodo.org/record/5229160.
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Uncontrolled Keywords: | lexical simplification; complex word identification; dataset; classification; Hindi |
Divisions: | Faculty of Science and Health Faculty of Science and Health > Computer Science and Electronic Engineering, School of |
SWORD Depositor: | Unnamed user with email elements@essex.ac.uk |
Depositing User: | Unnamed user with email elements@essex.ac.uk |
Date Deposited: | 27 Jan 2025 19:53 |
Last Modified: | 27 Jan 2025 19:54 |
URI: | http://repository.essex.ac.uk/id/eprint/35790 |
Available files
Filename: 2022.lrec-1.604.pdf
Licence: Creative Commons: Attribution-Noncommercial 4.0