Small Data in NLU: Proposals towards a Data-Centric Approach

Abstract

Domain-specific voice assistants often suffer from the problem of data scarcity. Publicly available, annotated datasets are in short supply and rarely fit the domain and the language required by a specific use case. Insufficient attention to data quality can generally be problematic when it comes to training and evaluation. The Computational Linguistics (CL) community has gained expertise and developed best practices for high-quality dataannotation and collection as well as for qualitative data analysis. However, the recent model-centric focus in AI and ML has not created ideal conditions for a fruitful collaboration with CL and the more data-centric fields of NLP to tackle data quality issues. We showcase principles and methods from CL/NLP research, which can potentially guide the development of data-centric NLU for domain-specific voice assistants - but have been typically overlooked by common practices in ML/AI. Those principles can potentially be of help to shape data-centric practices also for other domains. We argue that paying more attention to data quality and domain specificity can go a long way in improving the NLU components of today’s voiceassistants.

Publication
Peer-reviewed paper, presented as a lighting talk at the NeurIPS Data-centric AI Workshop
Alessandra Zarcone
Alessandra Zarcone
Professor of Language Technologies and Cognitive Assistants

Computational linguist with a background in NLP and in psycholinguistics, working on AI, NLP and human-machine interaction.