Automatic Speech Recognition for Polish in 2022

Evaluating selected ASRs on a corpus of customer support dialogs

Automatic Speech Recognition (ASR) also known as Speech-To-Text (STT) transcription and more specifically Large Vocabulary Continuous Speech Recognition (LVCSR) is a basic building block of many Natural Language Processing (NLP) solutions, such as voice-operated user interfaces, speech analytics applications and dialog systems.

The last few years have seen a significant increase in the demand for the latter two types of systems both in Poland and worldwide. Large and medium-size companies, including banks, insurance firms and public institutions have implemented speech analytics solutions to digitize, archive and explore recordings of spoken interactions and gain analytical insights into customer support and sales processes. The intrinsic quality of ASR systems is a key prerequisite for the efficiency of such applications. Even a seemingly small difference in the quality of ASR may be critical in certain contexts. For example, the take up rate of a voice bot may directly depend on the word error rate of its underlying ASR engine. It may be difficult to successfully deploy a voice bot with an overall acceptable ASR rate which nevertheless consistently fails to recognize phone numbers or dates. More sophisticated Natural Language Understanding (NLU) modules and the general usefulness of speech analytics results also hinge upon ASR quality.

This report looks at the following commercially offered ASR engines for Polish:

The text of the report is available here. The recordings, transcriptions and evaluation code can be purchased as part of the DiaBiz corpus.

Please contact us if you need any further information about the report at piotr.pezik@uni.lodz.pl.

Piotr Pęzik, Michał Adamczyk