spokesbiz
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
spokesbiz [2023/07/27 12:47] – [Acknowledgments] pezik | spokesbiz [2023/12/20 08:39] (current) – pezik | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== SpokesBiz ====== | ====== SpokesBiz ====== | ||
- | **SpokesBiz** a freely available corpus of conversational Polish developed within the CLARIN-BIZ project and currently comprising over 400 (out of the target 600) hours of recordings. The transcribed recordings have been diarized and manually annotated for punctuation and casing. We expect to reach the target size of 600 hours by the end of 2023. | + | **SpokesBiz** a freely available corpus of conversational Polish developed within the CLARIN-BIZ project and currently comprising over 650 hours of recordings. The transcribed recordings have been diarized and manually annotated for punctuation and casing. |
A general overview of the corpus can be found in this paper: | A general overview of the corpus can be found in this paper: | ||
- | * Piotr Pęzik, Sylwia Karasińska, | + | * Piotr Pęzik, Sylwia Karasińska, |
=== Subsets === | === Subsets === | ||
Line 11: | Line 11: | ||
^Subcorpus ^ Recordings ^ Words ^ Utterances ^ Hours ^ Speakers ^ | ^Subcorpus ^ Recordings ^ Words ^ Utterances ^ Hours ^ Speakers ^ | ||
- | |CBIZ_BIO | 147 | 1206401 | + | |CBIZ_BIO | 170 | 1383646| 68429| 166| 170 | |
- | |CBIZ_INT | 9 | 23424 | 1315 | 2 | 10 | | + | |CBIZ_INT | 10 | 26006 | 1575 | 2 | 11 | |
- | |CBIZ_LUZ | 45 | 213479 | + | |CBIZ_LUZ | 297 | 1510024 |
- | |CBIZ_POD | 81 | 427074 | + | |CBIZ_POD | 178 | 991464 |
- | |CBIZ_PRES | 53 | 242282 | + | |CBIZ_PRES | 56 | 256922 |
- | |CBIZ_VC | 72 | 542411 | + | |CBIZ_VC | 84 | 655539 |
- | |CBIZ_VC2 | 83 | 749740| 27947 | 87 | 85 | | + | |CBIZ_VC2 | 84 | 760671 |
- | |CBIZ_WYW | 26 | 170695 | + | |CBIZ_WYW | 46 | 327148 |
- | |Total | 516 | 3575506 | + | |Total | 925 | 5911420 |
Line 35: | Line 35: | ||
|recording_place | city where the recording was created (empty when unknown) | | |recording_place | city where the recording was created (empty when unknown) | | ||
|recording_time_ss | recording time in seconds | | |recording_time_ss | recording time in seconds | | ||
- | |segment_id | unique segment identifier | | ||
|segment_seq | segment order within the conversation | | |segment_seq | segment order within the conversation | | ||
- | |manually_added | + | |segment_id |
- | |segment_sub_seq | segment order for manually added segments (0 if segment wasn't manually added) | | + | |
- | |parent_segment_id | parent for manually added segment (empty if segment wasn't manually added) | + | |
|segment_text | segment text after manual correction | | |segment_text | segment text after manual correction | | ||
|segment_word_count | number of words within the segment (using SpaceTokenizer from NLTK) | | |segment_word_count | number of words within the segment (using SpaceTokenizer from NLTK) | | ||
- | |segment_ts_start_ms | segment beginning timestamp | + | |segment_ts_start_ms | segment beginning timestamp in milliseconds| |
- | |segment_ts_end_ms | segment ending timestamp in milliseconds | + | |segment_ts_end_ms | segment ending timestamp in milliseconds |
+ | |segment_words_ts_ms | timestamps for every word in segment | ||
|speaker_id | unique identifier for speaker | | |speaker_id | unique identifier for speaker | | ||
|speaker_sex | speaker sex with levels f and m | | |speaker_sex | speaker sex with levels f and m | | ||
|speaker_education | speaker education with levels none, primary, secondary, vocational, higher | | |speaker_education | speaker education with levels none, primary, secondary, vocational, higher | | ||
- | |speaker_age_range | + | |exact_speaker_age |
+ | |speaker_age_range_from | minimum range value with levels | ||
+ | |speaker_age_range_to | maximum range value with levels 19, 29, 39, 49, 59, 69, 79, 89, 99 | | ||
|speaker_region | Polish voivodeship (empty otherwise) | | |speaker_region | Polish voivodeship (empty otherwise) | | ||
|speaker_first_language | speaker first language or languages | | |speaker_first_language | speaker first language or languages | | ||
Line 63: | Line 63: | ||
- Users must cite the above-mentioned publication announcing SpokesBiz. | - Users must cite the above-mentioned publication announcing SpokesBiz. | ||
- The corpus must not be used for commercial purposes. | - The corpus must not be used for commercial purposes. | ||
- | - "If you remix, transform, or build upon the material, you may not distribute the modified material." | + | - "If you remix, transform, or build upon the material, you may not distribute the modified material." |
Line 72: | Line 72: | ||
* Paweł Wilk | * Paweł Wilk | ||
* Sylwia Karasińska | * Sylwia Karasińska | ||
+ | * Angelika Peljak-Łapińska | ||
* Karolina Adamczyk | * Karolina Adamczyk | ||
* Monika Garnys | * Monika Garnys |
spokesbiz.1690454827.txt.gz · Last modified: 2023/07/27 12:47 by pezik