Research Repository

The Influence of Text Pre-processing on Plagiarism Detection

Ceska, Z and Fox, C (2011) 'The Influence of Text Pre-processing on Plagiarism Detection.' In: Angelova, G and Bontcheva, K and Mitkov, R and Nicolov, N and Nikolov, N, (eds.) International Conference on Recent Advances in Natural Language Processing 2009. Association for Computational Linguistics, 55 - 59.


Download (561kB) | Preview


This paper explores the influence of text preprocessing techniques on plagiarism detection. We examine stop-word removal, lemmatization,number replacement, synonymy recognition, and word generalization. We also look into the influence of punctuation and word-order within N-grams. All these techniques are evaluated according to their impact on F1-measure and speed of execution. Our experiments were performed on a Czech corpus of plagiarized documents about politics. At the end of this paper, we propose what we consider to be the best combination of text pre-processing techniques.

Item Type: Book Section
Uncontrolled Keywords: Plagiarism; Copy Detection; Natural Language Processing; Stop-words; Lemmatization; Synonymy; WordNet; Thesaurus
Subjects: P Language and Literature > P Philology. Linguistics
Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions: Faculty of Science and Health > Computer Science and Electronic Engineering, School of
Depositing User: Users 161 not found.
Date Deposited: 18 Oct 2012 22:44
Last Modified: 17 Aug 2017 18:07

Actions (login required)

View Item View Item