SpokesBiz is a corpus of conversational Polish developed within the CLARIN-BIZ project and currently comprising over 650 hours of recordings. The transcribed recordings have been diarized and manually annotated for punctuation and casing.
A sample of the corpus can be accessed here.
Instructions for downloading the entire corpus (over 100GB of transcriptions with recordings) can be found below.
To access an online index of the corpus see: https://spokes.clarin-pl.eu.
A general overview of the corpus can be found here:
SpokesBiz is made up of several distinct subcorpora.
Subcorpus | Recordings | Words | Utterances | Hours | Speakers |
---|---|---|---|---|---|
CBIZ_BIO | 170 | 1383646 | 68429 | 166 | 170 |
CBIZ_INT | 10 | 26006 | 1575 | 2 | 11 |
CBIZ_LUZ | 297 | 1510024 | 103571 | 157 | 116 |
CBIZ_POD | 178 | 991464 | 47221 | 92 | 12 |
CBIZ_PRES | 56 | 256922 | 18120 | 38 | 39 |
CBIZ_VC | 84 | 655539 | 63188 | 71 | 110 |
CBIZ_VC2 | 84 | 760671 | 28397 | 89 | 86 |
CBIZ_WYW | 46 | 327148 | 16778 | 37 | 46 |
Total | 925 | 5911420 | 347279 | 652 | 590 |
The data was automatically automatically transcribed and time-aligned and subsequently manually corrected and annotated.
The table below summarises the metadata fields used to describe each utterance in the corpus.
Column | Description |
---|---|
conversation_id | unique identifier for each conversation |
recording_id | unique identifier for each recording |
recording_path | path to download recording file corresponding with the recording_id |
subcorpus | subcorpus name |
conversation_style | type of communication |
recording_year | year of recording |
recording_place | city where the recording was created (empty when unknown) |
recording_time_ss | recording time in seconds |
segment_seq | segment order within the conversation |
segment_id | unique segment identifier |
segment_text | segment text after manual correction |
segment_word_count | number of words within the segment (using SpaceTokenizer from NLTK) |
segment_ts_start_ms | segment beginning timestamp in milliseconds |
segment_ts_end_ms | segment ending timestamp in milliseconds |
segment_words_ts_ms | timestamps for every word in segment |
speaker_id | unique identifier for speaker |
speaker_sex | speaker sex with levels f and m |
speaker_education | speaker education with levels none, primary, secondary, vocational, higher |
exact_speaker_age | exact speaker age (NULL when unknown) |
speaker_age_range_from | minimum range value with levels 0, 20, 30, 40, 50, 60, 70, 80, 90 |
speaker_age_range_to | maximum range value with levels 19, 29, 39, 49, 59, 69, 79, 89, 99 |
speaker_region | Polish voivodeship (empty otherwise) |
speaker_first_language | speaker first language or languages |
Please fill out this form to get access to SpokesBiz: https://forms.office.com/e/cpn88mcFC6.
For more information contact piotr.pezik@uni.lodz.pl.
The current license of SpokesBiz is CC-BY-NC-ND. This means that:
SpokesBiz was developed in the project titled “CLARIN - Common Language Resources and Technology Infrastructure”, which is financed under the 2014-2020 Smart Growth Operational Programme, POIR.04.02.00-00C002/19. We would also like to acknowledge the support VoiceLab in the data transcription and processing efforts.