Research Repository

Textual Data Augmentation for Efficient Active Learning on Tiny Datasets

Quteineh, Husam and Samothrakis, Spyridon and Sutcliffe, Richard (2020) Textual Data Augmentation for Efficient Active Learning on Tiny Datasets. In: Empirical Methods in Natural Language Processing, 2020-11-16 - 2020-11-20, Online.

[img]
Preview
Text
2020.emnlp-main.600.pdf - Published Version
Available under License Creative Commons Attribution.

Download (337kB) | Preview

Abstract

In this paper we propose a novel data augmentation approach where guided outputs of a language generation model, e.g. GPT-2, when labeled, can improve the performance of text classifiers through an active learning process. We transform the data generation task into an optimization problem which maximizes the usefulness of the generated output, using Monte Carlo Tree Search (MCTS) as the optimization strategy and incorporating entropy as one of the optimization criteria. We test our approach against a Non-Guided Data Generation (NGDG) process that does not optimize for a reward function. Starting with a small set of data, our results show an increased performance with MCTS of 26% on the TREC-6 Questions dataset, and 10% on the Stanford Sentiment Treebank SST-2 dataset. Compared with NGDG, we are able to achieve increases of 3% and 5% on TREC-6 and SST-2.

Item Type: Conference or Workshop Item (Paper)
Additional Information: Published proceedings: EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
Divisions: Faculty of Science and Health
Faculty of Science and Health > Computer Science and Electronic Engineering, School of
SWORD Depositor: Elements
Depositing User: Elements
Date Deposited: 12 Nov 2020 09:45
Last Modified: 15 Jan 2022 01:35
URI: http://repository.essex.ac.uk/id/eprint/29084

Actions (login required)

View Item View Item