User Tools

Site Tools


spokesbiz

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
spokesbiz [2023/08/21 23:20] ljalowieckispokesbiz [2023/12/20 08:39] (current) pezik
Line 1: Line 1:
 ====== SpokesBiz ====== ====== SpokesBiz ======
-**SpokesBiz** a freely available corpus of conversational Polish developed within the CLARIN-BIZ project and currently comprising over 500 (out of the target 600) hours of recordings. The transcribed recordings have been diarized and manually annotated for punctuation and casing. We expect to reach the target size of 600 hours by the end of 2023.+**SpokesBiz** a freely available corpus of conversational Polish developed within the CLARIN-BIZ project and currently comprising over 650 hours of recordings. The transcribed recordings have been diarized and manually annotated for punctuation and casing. 
  
 A general overview of the corpus can be found in this paper:  A general overview of the corpus can be found in this paper: 
  
-  * Piotr Pęzik, Sylwia Karasińska, Anna Cichosz, Łukasz Jałowiecki, Konrad Kaczyński, Małgorzata Krawentek, Karolina Walkusz, Paweł Wilk, Mariusz Kleć, Krzysztof Szklanny, Szymon Marszałkowski 2023 (forthcoming) [[https://drive.google.com/file/d/1ES8hifCS3YbUFIkcCaU7i69hlguWBd4Y/view?usp=sharing|SpokesBiz – an Open Corpus of Conversational Polish]]+  * Piotr Pęzik, Sylwia Karasińska, Anna Cichosz, Łukasz Jałowiecki, Konrad Kaczyński, Małgorzata Krawentek, Karolina Walkusz, Paweł Wilk, Mariusz Kleć, Krzysztof Szklanny, Szymon Marszałkowski 2023 (forthcoming) [[http://arxiv.org/abs/2312.12364|SpokesBiz – an Open Corpus of Conversational Polish]]
  
 === Subsets === === Subsets ===
Line 11: Line 11:
    
 ^Subcorpus ^ Recordings ^ Words ^ Utterances ^ Hours ^ Speakers ^  ^Subcorpus ^ Recordings ^ Words ^ Utterances ^ Hours ^ Speakers ^ 
-|CBIZ_BIO | 158 1292062 63420 155 161 |   +|CBIZ_BIO | 170 138364668429166170 |   
-|CBIZ_INT | 23424 1315 | 2 | 10  | +|CBIZ_INT | 10 26006 1575 | 2 | 11  | 
-|CBIZ_LUZ | 158 814936 55601 83 82  | +|CBIZ_LUZ | 297 1510024 103571 157 116  | 
-|CBIZ_POD | 113 616864 28261 58 | 12  | +|CBIZ_POD | 178 991464 47221 92 | 12  | 
-|CBIZ_PRES | 55 252558 17570 | 37 | 38 |   +|CBIZ_PRES | 56 256922 18120 | 38 | 39 |   
-|CBIZ_VC | 78 590958 59350 64 107 +|CBIZ_VC | 84 655539 63188 71 110 
 |CBIZ_VC2 | 84 | 760671 | 28397 | 89 | 86 |  |CBIZ_VC2 | 84 | 760671 | 28397 | 89 | 86 | 
-|CBIZ_WYW | 38 263923 13964 31 40 |   +|CBIZ_WYW | 46 327148 16778 37 46 |   
-|Total | 693 4615396 267878 519 536 |+|Total | 925 5911420 347279 652 590 |
  
  
Line 63: Line 63:
   - Users must cite the above-mentioned publication announcing SpokesBiz.   - Users must cite the above-mentioned publication announcing SpokesBiz.
   - The corpus must not be used for commercial purposes.   - The corpus must not be used for commercial purposes.
-  - "If you remix, transform, or build upon the material, you may not distribute the modified material." In other words, you can build and distribute tools or models based on the material, but you must not redistribute the corpus itself.+  - "If you remix, transform, or build upon the material, you may not distribute the modified material." In other words, you can build and distribute tools or models based on the material, but you must not redistribute the corpus data itself or any parts of itIf you need a different license for the corpus, please contact us at pelcra@uni.lodz.pl
    
  
spokesbiz.1692652852.txt.gz · Last modified: 2023/08/21 23:20 by ljalowiecki