SpokesBiz a freely available corpus of conversational Polish developed within the CLARIN-BIZ project and currently comprising over 650 hours of recordings. The transcribed recordings have been diarized and manually annotated for punctuation and casing.
A general overview of the corpus can be found in this paper:
SpokesBiz is made up of several distinct subcorpora.
Subcorpus | Recordings | Words | Utterances | Hours | Speakers |
---|---|---|---|---|---|
CBIZ_BIO | 170 | 1383646 | 68429 | 166 | 170 |
CBIZ_INT | 10 | 26006 | 1575 | 2 | 11 |
CBIZ_LUZ | 297 | 1510024 | 103571 | 157 | 116 |
CBIZ_POD | 178 | 991464 | 47221 | 92 | 12 |
CBIZ_PRES | 56 | 256922 | 18120 | 38 | 39 |
CBIZ_VC | 84 | 655539 | 63188 | 71 | 110 |
CBIZ_VC2 | 84 | 760671 | 28397 | 89 | 86 |
CBIZ_WYW | 46 | 327148 | 16778 | 37 | 46 |
Total | 925 | 5911420 | 347279 | 652 | 590 |
The data was automatically automatically transcribed and time-aligned and subsequently manually corrected and annotated.
The table below summarises the metadata fields used to describe each utterance in the corpus.
Column | Description |
---|---|
conversation_id | unique identifier for each conversation |
recording_id | unique identifier for each recording |
recording_path | path to download recording file corresponding with the recording_id |
subcorpus | subcorpus name |
conversation_style | type of communication |
recording_year | year of recording |
recording_place | city where the recording was created (empty when unknown) |
recording_time_ss | recording time in seconds |
segment_seq | segment order within the conversation |
segment_id | unique segment identifier |
segment_text | segment text after manual correction |
segment_word_count | number of words within the segment (using SpaceTokenizer from NLTK) |
segment_ts_start_ms | segment beginning timestamp in milliseconds |
segment_ts_end_ms | segment ending timestamp in milliseconds |
segment_words_ts_ms | timestamps for every word in segment |
speaker_id | unique identifier for speaker |
speaker_sex | speaker sex with levels f and m |
speaker_education | speaker education with levels none, primary, secondary, vocational, higher |
exact_speaker_age | exact speaker age (NULL when unknown) |
speaker_age_range_from | minimum range value with levels 0, 20, 30, 40, 50, 60, 70, 80, 90 |
speaker_age_range_to | maximum range value with levels 19, 29, 39, 49, 59, 69, 79, 89, 99 |
speaker_region | Polish voivodeship (empty otherwise) |
speaker_first_language | speaker first language or languages |
Please fill out this form to get access to SpokesBiz: https://forms.office.com/e/cpn88mcFC6.
For more information contact piotr.pezik@uni.lodz.pl.
The current license of SpokesBiz is CC-BY-NC-ND. This means that:
SpokesBiz was developed in the project titled “CLARIN - Common Language Resources and Technology Infrastructure”, which is financed under the 2014-2020 Smart Growth Operational Programme, POIR.04.02.00-00C002/19. We would also like to acknowledge the support VoiceLab in the data transcription and processing efforts.