User Tools

Site Tools


diabiz

This is an old revision of the document!


DiaBiz

DiaBiz corpus is a dialog corpus comprising recordings and annotated transcriptions of phone-based customer-agent interactions in several key business domains.

The corpus comprises:

  • 3,766 conversations amounting to 385 hours and over 3 million words
  • dialogues between 5 professional call-center agents and 189 participants as customers
  • data from 8 business domains with high commercial demand for conversational analytics and automation solutions
  • dialogues based on 200 real-life interaction scenarios

The domains covered:

Domain Dialogs Words Duration (HH:MM:SS)
Banking 907 773,858 92:56:54
Car rental 246 189,741 24:07:07
Debt collection 300 245,031 29:23:56
Energy services 390 248,295 30:05:42
Insurance 401 307,760 40:00:54
Medical care 371 236,057 30:13:57
Telecommunications 700 416,333 52:21:52
Tourism 451 674,066 86:23:10
Total 3,766 3,091,141 385:33:32

The data was manually transcribed, time-aligned and annotated.

Applications

Customer support interactions recorded by operators of call centers are highly unlikely to be widely released in any useful form as they contain sensitive information which is subject to strict privacy regulations. NLP start-ups and academic research groups have to develop their own datasets or rely on limited resources which cannot be directly adapted to commercially viable domains. The DiaBiz corpus can serve as a source of training and evaluation data for a wide range of intrinsic and downstream tasks, such as:

  • speech recognition and transcript formatting
  • speaker diarization
  • conversational intent and named entity recognition
  • spoken dialog segmentation, labelling and classification
  • conversational analytics as well as more sophisticated modelling of dialog systems.

The DiaBiz corpus is therefore a major new resource for spoken Polish, offering research potential and making it possible to bootstrap the development of language processing tools for automating linguistic interactions with high volumes of customers, such as voice bots and other dialog systems.

Availability

Click HERE to download sample recordings.

The current version of the recording catalog is available HERE.

For more information, please contact piotr.pezik@uni.lodz.pl .

Project Team

  • Piotr Pęzik
  • Michał Adamczyk
  • Małgorzata Krawentek
  • Paweł Wilk
  • Sylwia Karasińska
  • Karolina Adamczyk
  • Monika Garnys
  • Karolina Walkusz
  • Angelika Peljak-Łapińska
  • Anna Cichosz
  • Mikołaj Deckert
  • Paulina Rybińska
  • Izabela Grabarczyk
  • Maciej Grabski
  • Karol Ługowski
  • Gracjan Stepaniec-Krawentek
  • Krzysztof Hejduk
  • Michał Koźmiński
  • Zuzanna Deckert
  • Piotr Górniak

Acknowledgments

CLARIN-BIZ

DiaBiz was developed in the project titled “CLARIN - Common Language Resources and Technology Infrastructure”, which is financed under the 2014-2020 Smart Growth Operational Programme, POIR.04.02.00-00C002/19. We would also like to acknowledge the support of three companies: VoiceLab, Genesys and Damovo in the data collection and transcription efforts.

diabiz.1642839366.txt.gz · Last modified: 2022/01/22 09:16 by pezik