User Tools

Site Tools


spoken_offline_corpora

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
spoken_offline_corpora [2022/07/27 12:04] pezikspoken_offline_corpora [2023/10/08 11:22] (current) pezik
Line 1: Line 1:
 ======PELCRA Spoken Offline Corpora====== ======PELCRA Spoken Offline Corpora======
  
-PELCRA Spoken Offline Corpora of conversational Polish which were  collected as part of the [[http://clarin-pl.eu|CLARIN-PL project]] are available via links in the table below+PELCRA Spoken Offline Corpora of conversational Polish (a.k.a SpokesMix)  were  collected as part of the [[http://clarin-pl.eu|CLARIN-PL project]]. 
- +
 Each corpus consists of speech recordings (in WAV format) and word-by-word transcriptions, which also include some non-speech events. The transcriptions (in EAF format) are complemented with words and phones annotations ( _out.eaf files), and, if available, with video content (MP4 format) and PDF transcripts. Each corpus consists of speech recordings (in WAV format) and word-by-word transcriptions, which also include some non-speech events. The transcriptions (in EAF format) are complemented with words and phones annotations ( _out.eaf files), and, if available, with video content (MP4 format) and PDF transcripts.
  
-Metadata are provided in XML files listing information about the recordings (titles, topics, dates, and URLs), media available (audio, video, pdf), and annotation details (file, date, annotator, place, duration, with additional information about the speakers whenever such data was available). A Document Type Definition specifying the structure of the elements and attributes of an XML document is included in each of the corpora. SQLite database with all the corpora metadata is also available for [[https://uniwersytetlodzki-my.sharepoint.com/:u:/g/personal/pelcra_uni_lodz_pl/Ec0r-4hNLalIrTnsHl4BQQgBJUdhAK7OMmKp28vh7_Ze2w?e=bwxGsp|download.]] +Metadata are provided in XML files listing information about the recordings (titles, topics, dates, and URLs), media available (audio, video, pdf), and annotation details (file, date, annotator, place, duration, with additional information about the speakers whenever such data was available). A Document Type Definition specifying the structure of the elements and attributes of an XML document is included in each of the corpora.
- +
-Most of these offline corpora are indexed in [[http://spokes.clarin-pl.eu/|Spokes]] and [[http://pelcra.clarin-pl.eu/spokes2-web/|Spokes2]] +
- +
-All PELCRA spoken offline corpora have altogether about 1 220 400 words. +
  
-^  corpus  ^  description  ^  recordings  ^  speakers  ^   word  count    voice activity time  (hh:mm)  ^  total duration  (hh:mm)  ^  link  ^ +Most of these offline corpora are indexed in [[http://spokes.clarin-pl.eu/|Spokes]] and [[http://pelcra.clarin-pl.eu/spokes2-web/|Spokes2]]. A subset of them can be obtained by  [[https://forms.office.com/e/TMAA36FRwf|filling out this form]]. Once the form is submitted you will get a password necessary to download the corpora.
-| PELCRA_EMO |A corpus of focused interviews (people reflecting upon their emotions). |  40  |  80  |  252,000  |  26:53  |  28:12  |  [[https://uniwersytetlodzki-my.sharepoint.com/:f:/g/personal/pelcra_uni_lodz_pl/EtxBCk44jGZIs24XDRds4lgB9SyeP9w_VRRzVwNRTq1C9w?e=RpiHr0|Download]] +
-| PELCRA_LUZ |A corpus of open interviews. |  21  |  42  |  213,000  |  20:14  |  19:58  | [[https://uniwersytetlodzki-my.sharepoint.com/:f:/g/personal/pelcra_uni_lodz_pl/EnMgq0aOpPVGvYeR_SbjZIMBkz-gLj3qRR1uTGbhDlNJ6g?e=MEhHRG|Download]] | +
-| PELCRA_EMI |A corpus of Polish emmigrants to Scotland|  22  |  44  |  96,000  |  09:36  |  18:07  | [[https://uniwersytetlodzki-my.sharepoint.com/:f:/g/personal/pelcra_uni_lodz_pl/EgbGrvTeG65Kjw4eYNAZiuYBi0bvh_2x4VbvMp_kCobGzw?e=lpdTOz|Download]] +
-| PELCRA_PARL |Samples of spoken parliamentary data|  48  |  241  |  99,000  |  12:22  |  14:13  | [[https://uniwersytetlodzki-my.sharepoint.com/:f:/g/personal/pelcra_uni_lodz_pl/EpPehikqGqZJltrAKlVp3k0BOeyzEgBBO_ZwmFC9WaLbWw|Download]] | +
-| PELCRA_YT_1 |Samples of Polish YouTubers' videos. |  25  |  106  |  49,000  |  04:56  |  06:39  | [[https://uniwersytetlodzki-my.sharepoint.com/:f:/g/personal/pelcra_uni_lodz_pl/EgfWPA8zoqlGsssFaSFrYbQB8BURhlZFcAy2ADiBs-YzHQ?e=zqWonY|Download]] | +
-| PELCRA_YT_2 |Second part of Polish YouTubers' videos. |  23  |  45  |  49,000  |  05:10  |  05:46  | [[https://uniwersytetlodzki-my.sharepoint.com/:f:/g/personal/pelcra_uni_lodz_pl/El5AGQYNbi9PmaFFFC91nBkBHLBJ-xhS6hdOF4vzJ-3ocQ?e=Rjtn7U|Download]] +
-|  MMW_1  |A corpus of Polish conversations recorded in Wrocław in the 1980s|  14  |  65  |  60,000  |  7:02  |  8:33  | [[https://uniwersytetlodzki-my.sharepoint.com/:f:/g/personal/pelcra_uni_lodz_pl/EiZ4mWHrcaVAoxiw_hPyPgMBYmjh7DeT9i0_uHJk0ylL4Q?e=1RQRDz|Download]] | +
-|  MMW_2  |Second part of the conversations recorded in Wrocław in the 1980s|  14  |  38  |  70,000  |  7:31  |  7:50  | [[https://uniwersytetlodzki-my.sharepoint.com/:f:/g/personal/pelcra_uni_lodz_pl/EtL-8yHklwJArXYDV32qaOUBnKDm6JVq-asjBfxuYakC2g?e=8KgWjM|Download]] | +
-|  MMK  |A corpus of Polish conversations recorded in Kraków in the 1980s. |  4  |  11  |  15,900  |  1:46  |  1:49  | [[https://uniwersytetlodzki-my.sharepoint.com/:f:/g/personal/pelcra_uni_lodz_pl/EvGqmvgV8KlCp4J30OtSNgQBWX08KGm02-2Yzof3Kq5D6w|Download]] | +
-| PELCRA_IDIO |A corpus of open interviews in Polish. |  146  |  148  |  327,500  |  :  |  38:51  | To be announced.| +
-|  |  **total**|  357  |  820  |  1,220,400  |  100:47  |  149:58  | |+
  
 +^  corpus  ^  description  ^  recordings  ^  speakers  ^   word  count    voice activity time  (hh:mm)  ^  total duration  (hh:mm) 
 +| PELCRA_EMO |A corpus of focused interviews (people reflecting upon their emotions). |  40  |  80  |  252,000  |  26:53  |  28:12  |
 +| PELCRA_LUZ |A corpus of open interviews. |  21  |  42  |  213,000  |  20:14  |  19:58  | 
 +| PELCRA_EMI |A corpus of Polish emmigrants to Scotland. |  22  |  44  |  96,000  |  09:36  |  18:07  |
 +| PELCRA_PARL |Samples of spoken parliamentary data. |  48  |  241  |  99,000  |  12:22  |  14:13  | 
 +| PELCRA_YT_1 |Samples of Polish YouTubers' videos. |  25  |  106  |  49,000  |  04:56  |  06:39  | 
 +| PELCRA_YT_2 |Second part of Polish YouTubers' videos. |  23  |  45  |  49,000  |  05:10  |  05:46  |
 +| MMW_1  |A corpus of Polish conversations recorded in Wrocław in the 1980s. |  14  |  65  |  60,000 | 7:02  |  8:33  |
 +| MMW_2  |Second part of the conversations recorded in Wrocław in the 1980s. |  14  |  38  |  70,000 | 7:31  |  7:50  |
 +| MMK  |A corpus of Polish conversations recorded in Kraków in the 1980s. |  4  |  11  |  15,900  |  1:46  |  1:49  |
 +| PELCRA_IDIO |A corpus of open interviews in Polish. |  146  |  148  |  327,500  |  :  |  38:51  | 
 +|  |  **TOTAL**|  357  |  820  |  1,220,400  |  100:47  |  149:58  |
  
    
spoken_offline_corpora.1658916294.txt.gz · Last modified: 2022/07/27 12:04 by pezik