User Tools

Site Tools


spokesbiz

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
spokesbiz [2023/07/27 12:47] – [Acknowledgments] pezikspokesbiz [2023/12/20 08:39] (current) pezik
Line 1: Line 1:
 ====== SpokesBiz ====== ====== SpokesBiz ======
-**SpokesBiz** a freely available corpus of conversational Polish developed within the CLARIN-BIZ project and currently comprising over 400 (out of the target 600) hours of recordings. The transcribed recordings have been diarized and manually annotated for punctuation and casing. We expect to reach the target size of 600 hours by the end of 2023.+**SpokesBiz** a freely available corpus of conversational Polish developed within the CLARIN-BIZ project and currently comprising over 650 hours of recordings. The transcribed recordings have been diarized and manually annotated for punctuation and casing. 
  
 A general overview of the corpus can be found in this paper:  A general overview of the corpus can be found in this paper: 
  
-  * Piotr Pęzik, Sylwia Karasińska, Anna Cichosz, Łukasz Jałowiecki, Konrad Kaczyński, Małgorzata Krawentek, Karolina Walkusz, Paweł Wilk, Mariusz Kleć, Krzysztof Szklanny, Szymon Marszałkowski 2023 (forthcoming) SpokesBiz – an Open Corpus of Conversational Polish. [[https://drive.google.com/file/d/1ES8hifCS3YbUFIkcCaU7i69hlguWBd4Y/view?usp=sharing|A draft]]+  * Piotr Pęzik, Sylwia Karasińska, Anna Cichosz, Łukasz Jałowiecki, Konrad Kaczyński, Małgorzata Krawentek, Karolina Walkusz, Paweł Wilk, Mariusz Kleć, Krzysztof Szklanny, Szymon Marszałkowski 2023 (forthcoming) [[http://arxiv.org/abs/2312.12364|SpokesBiz – an Open Corpus of Conversational Polish]]
  
 === Subsets === === Subsets ===
Line 11: Line 11:
    
 ^Subcorpus ^ Recordings ^ Words ^ Utterances ^ Hours ^ Speakers ^  ^Subcorpus ^ Recordings ^ Words ^ Utterances ^ Hours ^ Speakers ^ 
-|CBIZ_BIO | 147 1206401 56641 145 150 |   +|CBIZ_BIO | 170 138364668429166170 |   
-|CBIZ_INT | 23424 1315 | 2 | 10  | +|CBIZ_INT | 10 26006 1575 | 2 | 11  | 
-|CBIZ_LUZ | 45 213479 14980 21 39  | +|CBIZ_LUZ | 297 1510024 103571 157 116  | 
-|CBIZ_POD | 81 427074 20397 41 | 12  | +|CBIZ_POD | 178 991464 47221 92 | 12  | 
-|CBIZ_PRES | 53 242282 17107 | 36 | 38 |   +|CBIZ_PRES | 56 256922 18120 | 38 | 39 |   
-|CBIZ_VC | 72 542411 57017 58 101 |  +|CBIZ_VC | 84 655539 63188 71 110 |  
-|CBIZ_VC2 | 83 74974027947 87 85 |  +|CBIZ_VC2 | 84 760671 28397 89 86 |  
-|CBIZ_WYW | 26 170695 8363 19 31 |   +|CBIZ_WYW | 46 327148 16778 37 46 |   
-|Total | 516 3575506 203767 409 441 |+|Total | 925 5911420 347279 652 590 |
  
  
Line 35: Line 35:
 |recording_place | city where the recording was created (empty when unknown) | |recording_place | city where the recording was created (empty when unknown) |
 |recording_time_ss | recording time in seconds | |recording_time_ss | recording time in seconds |
-|segment_id | unique segment identifier | 
 |segment_seq | segment order within the conversation | |segment_seq | segment order within the conversation |
-|manually_added True if segment was added in manual correction process | +|segment_id unique segment identifier |
-|segment_sub_seq | segment order for manually added segments (0 if segment wasn't manually added) | +
-|parent_segment_id | parent for manually added segment (empty if segment wasn't manually added) |+
 |segment_text | segment text after manual correction | |segment_text | segment text after manual correction |
 |segment_word_count | number of words within the segment (using SpaceTokenizer from NLTK) | |segment_word_count | number of words within the segment (using SpaceTokenizer from NLTK) |
-|segment_ts_start_ms | segment beginning timestamp  in milliseconds (generated automatically before manual correction) +|segment_ts_start_ms | segment beginning timestamp in milliseconds| 
-|segment_ts_end_ms | segment ending timestamp in milliseconds (generated automatically before manual correction) |+|segment_ts_end_ms | segment ending timestamp in milliseconds 
 +|segment_words_ts_ms | timestamps for every word in segment |
 |speaker_id | unique identifier for speaker | |speaker_id | unique identifier for speaker |
 |speaker_sex | speaker sex with levels f and m | |speaker_sex | speaker sex with levels f and m |
 |speaker_education | speaker education with levels none, primary, secondary, vocational, higher | |speaker_education | speaker education with levels none, primary, secondary, vocational, higher |
-|speaker_age_range | speaker age with levels < 20, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89>= 90 |+|exact_speaker_age exact speaker age (NULL when unknown) | 
 +|speaker_age_range_from | minimum range value with levels 0, 20, 30, 40, 50, 60, 70, 80, 90 
 +|speaker_age_range_to | maximum range value with levels 19, 29, 39, 49, 59, 69, 79, 89, 99 |
 |speaker_region | Polish voivodeship (empty otherwise) | |speaker_region | Polish voivodeship (empty otherwise) |
 |speaker_first_language | speaker first language or languages | |speaker_first_language | speaker first language or languages |
Line 63: Line 63:
   - Users must cite the above-mentioned publication announcing SpokesBiz.   - Users must cite the above-mentioned publication announcing SpokesBiz.
   - The corpus must not be used for commercial purposes.   - The corpus must not be used for commercial purposes.
-  - "If you remix, transform, or build upon the material, you may not distribute the modified material." In other words, you can build and distribute tools or models based on the material, but you must not redistribute the corpus itself.+  - "If you remix, transform, or build upon the material, you may not distribute the modified material." In other words, you can build and distribute tools or models based on the material, but you must not redistribute the corpus data itself or any parts of itIf you need a different license for the corpus, please contact us at pelcra@uni.lodz.pl
    
  
Line 72: Line 72:
   * Paweł Wilk   * Paweł Wilk
   * Sylwia Karasińska   * Sylwia Karasińska
 +  * Angelika Peljak-Łapińska
   * Karolina Adamczyk   * Karolina Adamczyk
   * Monika Garnys   * Monika Garnys
spokesbiz.1690454827.txt.gz · Last modified: 2023/07/27 12:47 by pezik