User Tools

Site Tools


spokesbiz

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
spokesbiz [2023/08/11 20:31] pezikspokesbiz [2023/12/20 08:39] (current) pezik
Line 1: Line 1:
 ====== SpokesBiz ====== ====== SpokesBiz ======
-**SpokesBiz** a freely available corpus of conversational Polish developed within the CLARIN-BIZ project and currently comprising over 400 (out of the target 600) hours of recordings. The transcribed recordings have been diarized and manually annotated for punctuation and casing. We expect to reach the target size of 600 hours by the end of 2023.+**SpokesBiz** a freely available corpus of conversational Polish developed within the CLARIN-BIZ project and currently comprising over 650 hours of recordings. The transcribed recordings have been diarized and manually annotated for punctuation and casing. 
  
 A general overview of the corpus can be found in this paper:  A general overview of the corpus can be found in this paper: 
  
-  * Piotr Pęzik, Sylwia Karasińska, Anna Cichosz, Łukasz Jałowiecki, Konrad Kaczyński, Małgorzata Krawentek, Karolina Walkusz, Paweł Wilk, Mariusz Kleć, Krzysztof Szklanny, Szymon Marszałkowski 2023 (forthcoming) [[https://drive.google.com/file/d/1ES8hifCS3YbUFIkcCaU7i69hlguWBd4Y/view?usp=sharing|SpokesBiz – an Open Corpus of Conversational Polish]]+  * Piotr Pęzik, Sylwia Karasińska, Anna Cichosz, Łukasz Jałowiecki, Konrad Kaczyński, Małgorzata Krawentek, Karolina Walkusz, Paweł Wilk, Mariusz Kleć, Krzysztof Szklanny, Szymon Marszałkowski 2023 (forthcoming) [[http://arxiv.org/abs/2312.12364|SpokesBiz – an Open Corpus of Conversational Polish]]
  
 === Subsets === === Subsets ===
Line 11: Line 11:
    
 ^Subcorpus ^ Recordings ^ Words ^ Utterances ^ Hours ^ Speakers ^  ^Subcorpus ^ Recordings ^ Words ^ Utterances ^ Hours ^ Speakers ^ 
-|CBIZ_BIO | 147 1206401 56641 145 150 |   +|CBIZ_BIO | 170 138364668429166170 |   
-|CBIZ_INT | 23424 1315 | 2 | 10  | +|CBIZ_INT | 10 26006 1575 | 2 | 11  | 
-|CBIZ_LUZ | 45 213479 14980 21 39  | +|CBIZ_LUZ | 297 1510024 103571 157 116  | 
-|CBIZ_POD | 81 427074 20397 41 | 12  | +|CBIZ_POD | 178 991464 47221 92 | 12  | 
-|CBIZ_PRES | 53 242282 17107 | 36 | 38 |   +|CBIZ_PRES | 56 256922 18120 | 38 | 39 |   
-|CBIZ_VC | 72 542411 57017 58 101 |  +|CBIZ_VC | 84 655539 63188 71 110 |  
-|CBIZ_VC2 | 83 74974027947 87 85 |  +|CBIZ_VC2 | 84 760671 28397 89 86 |  
-|CBIZ_WYW | 26 170695 8363 19 31 |   +|CBIZ_WYW | 46 327148 16778 37 46 |   
-|Total | 516 3575506 203767 409 441 |+|Total | 925 5911420 347279 652 590 |
  
  
Line 63: Line 63:
   - Users must cite the above-mentioned publication announcing SpokesBiz.   - Users must cite the above-mentioned publication announcing SpokesBiz.
   - The corpus must not be used for commercial purposes.   - The corpus must not be used for commercial purposes.
-  - "If you remix, transform, or build upon the material, you may not distribute the modified material." In other words, you can build and distribute tools or models based on the material, but you must not redistribute the corpus itself.+  - "If you remix, transform, or build upon the material, you may not distribute the modified material." In other words, you can build and distribute tools or models based on the material, but you must not redistribute the corpus data itself or any parts of itIf you need a different license for the corpus, please contact us at pelcra@uni.lodz.pl
    
  
Line 72: Line 72:
   * Paweł Wilk   * Paweł Wilk
   * Sylwia Karasińska   * Sylwia Karasińska
 +  * Angelika Peljak-Łapińska
   * Karolina Adamczyk   * Karolina Adamczyk
   * Monika Garnys   * Monika Garnys
Line 82: Line 83:
   * Maciej Grabski   * Maciej Grabski
   * Karol Ługowski   * Karol Ługowski
-  * Angelika Peljak-Łapińska 
   * Michał Koźmiński   * Michał Koźmiński
   * Zuzanna Deckert   * Zuzanna Deckert
spokesbiz.1691778710.txt.gz · Last modified: 2023/08/11 20:31 by pezik