====== SpokesBiz ======
**SpokesBiz** is a corpus of conversational Polish developed within the CLARIN-BIZ project and currently comprising over 650 hours of recordings. The transcribed recordings have been diarized and manually annotated for punctuation and casing. 


A sample of the corpus can be [[https://drive.google.com/file/d/1n7pgueKAylKwExzecXhdtauTI6rMzFug/view?usp=sharing | accessed here]].

Instructions for downloading the entire corpus (over 100GB of transcriptions with recordings) can be found below. 

To access an online index of the corpus see: [[https://spokes.clarin-pl.eu]].

A general overview of the corpus can be found here: 

  * Piotr Pęzik, Sylwia Karasińska, Anna Cichosz, Łukasz Jałowiecki, Konrad Kaczyński, Małgorzata Krawentek, Karolina Walkusz, Paweł Wilk, Mariusz Kleć, Krzysztof Szklanny, Szymon Marszałkowski 2023 (forthcoming) [[http://arxiv.org/abs/2312.12364|SpokesBiz – an Open Corpus of Conversational Polish]]

=== Subsets ===

SpokesBiz is made up of several distinct subcorpora.
 
^Subcorpus ^ Recordings ^ Words ^ Utterances ^ Hours ^ Speakers ^ 
|CBIZ_BIO | 170 | 1383646| 68429| 166| 170 |  
|CBIZ_INT | 10 | 26006 | 1575 | 2 | 11  |
|CBIZ_LUZ | 297 | 1510024 | 103571 | 157 | 116  |
|CBIZ_POD | 178 | 991464 | 47221 | 92 | 12  |
|CBIZ_PRES | 56 | 256922 | 18120 | 38 | 39 |  
|CBIZ_VC | 84 | 655539 | 63188 | 71 | 110 | 
|CBIZ_VC2 | 84 | 760671 | 28397 | 89 | 86 | 
|CBIZ_WYW | 46 | 327148 | 16778 | 37 | 46 |  
|Total | 925 | 5911420 | 347279 | 652 | 590 |


The data was automatically automatically **transcribed** and **time-aligned** and subsequently manually **corrected** and **annotated**. 

The table below summarises the metadata fields used to describe each utterance in the corpus.

^Column ^ Description ^ 
|conversation_id | unique identifier for each conversation |
|recording_id | unique identifier for each recording |
|recording_path | path to download recording file corresponding with the recording_id |
|subcorpus | subcorpus name |
|conversation_style | type of communication |
|recording_year | year of recording |
|recording_place | city where the recording was created (empty when unknown) |
|recording_time_ss | recording time in seconds |
|segment_seq | segment order within the conversation |
|segment_id | unique segment identifier |
|segment_text | segment text after manual correction |
|segment_word_count | number of words within the segment (using SpaceTokenizer from NLTK) |
|segment_ts_start_ms | segment beginning timestamp in milliseconds|
|segment_ts_end_ms | segment ending timestamp in milliseconds |
|segment_words_ts_ms | timestamps for every word in segment |
|speaker_id | unique identifier for speaker |
|speaker_sex | speaker sex with levels f and m |
|speaker_education | speaker education with levels none, primary, secondary, vocational, higher |
|exact_speaker_age | exact speaker age (NULL when unknown) |
|speaker_age_range_from | minimum range value with levels 0, 20, 30, 40, 50, 60, 70, 80, 90 |
|speaker_age_range_to | maximum range value with levels 19, 29, 39, 49, 59, 69, 79, 89, 99 |
|speaker_region | Polish voivodeship (empty otherwise) |
|speaker_first_language | speaker first language or languages |

=====Availability=====


Please fill out this form to get access to SpokesBiz: [[https://forms.office.com/e/cpn88mcFC6|https://forms.office.com/e/cpn88mcFC6]].

For more information contact piotr.pezik@uni.lodz.pl.

=====License=====

The current license of SpokesBiz is [[https://creativecommons.org/licenses/by-nc-nd/3.0/|CC-BY-NC-ND]]. This means that: 
  - Users must cite the above-mentioned publication announcing SpokesBiz.
  - The corpus must not be used for commercial purposes.
  - "If you remix, transform, or build upon the material, you may not distribute the modified material." In other words, you can build and distribute tools or models based on the material, but you must not redistribute the corpus data itself or any parts of it. If you need a different license for the corpus, please contact us at pelcra@uni.lodz.pl
 

=====Project Team====
  * Piotr Pęzik
  * Michał Adamczyk
  * Małgorzata Krawentek
  * Paweł Wilk
  * Sylwia Karasińska
  * Angelika Peljak-Łapińska
  * Karolina Adamczyk
  * Monika Garnys
  * Karolina Walkusz
  * Anna Cichosz
  * Anna Kwiatkowska
  * Mikołaj Deckert
  * Paulina Rybińska
  * Izabela Grabarczyk
  * Maciej Grabski
  * Karol Ługowski
  * Michał Koźmiński
  * Zuzanna Deckert
  * Piotr Górniak
  * Konrad Kaczyński
  * Łukasz Jałowiecki

=====Acknowledgments====

[[https://clarin.biz/index.php/en/|CLARIN-BIZ]]

SpokesBiz was developed in the project titled “CLARIN - Common Language Resources and
Technology Infrastructure”, which is financed under the 2014-2020 Smart Growth Operational Programme,
POIR.04.02.00-00C002/19. We would also like to acknowledge the support  VoiceLab in the data transcription and processing efforts.

{{:screenshot_2025-09-09_at_15.32.42.png?400|}}