DiaBiz corpus is a dialog corpus comprising recordings and annotated transcriptions of phone-based customer-agent interactions in several key business domains.
A general overview of the corpus can be found in this paper:
Also see the accompanying poster here:
The data was automatically automatically transcribed and time-aligned and subsequently manually corrected and annotated.
Customer support interactions recorded by operators of call centers are highly unlikely to be widely released in any useful form as they contain sensitive information which is subject to strict privacy regulations. NLP start-ups and academic research groups have to develop their own datasets or rely on limited resources which cannot be directly adapted to commercially viable domains. The DiaBiz corpus can serve as a source of training and evaluation data for a wide range of intrinsic and downstream tasks, such as:
The DiaBiz corpus is therefore a major new resource for spoken Polish, offering research potential and making it possible to bootstrap the development of language processing tools for automating linguistic interactions with high volumes of customers, such as voice bots and other dialog systems.
Click HERE to download sample recordings.
The current version of the recording catalog is available HERE.
For more information, please contact firstname.lastname@example.org .
DiaBiz was developed in the project titled “CLARIN - Common Language Resources and Technology Infrastructure”, which is financed under the 2014-2020 Smart Growth Operational Programme, POIR.04.02.00-00C002/19. We would also like to acknowledge the support of three companies: VoiceLab, Genesys and Damovo in the data collection and transcription efforts.