User Tools

Site Tools


spoken_offline_corpora

PELCRA Spoken Offline Corpora

PELCRA Spoken Offline Corpora of conversational Polish which were collected as part of the CLARIN-PL project are available via links in the table below.

Each corpus consists of speech recordings (in WAV format) and word-by-word transcriptions, which also include some non-speech events. The transcriptions (in EAF format) are complemented with words and phones annotations ( _out.eaf files), and, if available, with video content (MP4 format) and PDF transcripts.

Metadata are provided in XML files listing information about the recordings (titles, topics, dates, and URLs), media available (audio, video, pdf), and annotation details (file, date, annotator, place, duration, with additional information about the speakers whenever such data was available). A Document Type Definition specifying the structure of the elements and attributes of an XML document is included in each of the corpora. SQLite database with all the corpora metadata is also available for download.

Most of these offline corpora are indexed in Spokes and Spokes2

All PELCRA spoken offline corpora have altogether about 928 500 words.

corpus description recordings speakers word count voice activity time (hh:mm) total duration (hh:mm) link
PELCRA_EMO A corpus of focused interviews (people reflecting upon their emotions). 40 80 252,000 26:53 28:12 Download
PELCRA_LUZ A corpus of open interviews. 21 42 213,000 20:14 19:58 Download
PELCRA_EMI A corpus of Polish emmigrants to Scotland. 22 44 96,000 09:36 18:07 Download
PELCRA_PARL Samples of spoken parliamentary data. 48 241 99,000 12:22 14:13 Download
PELCRA_YT_1 Samples of Polish YouTubers' videos. 25 106 49,000 04:56 06:39 Download
PELCRA_YT_2 Second part of Polish YouTubers' videos. 23 45 49,000 05:10 05:46 Download
MOWA_MIAST A corpus of Polish conversations recorded in the 1980s. 28 103 130,000 14:33 16:23 Download
PELCRA_IDIO A corpus of open interviews in Polish. 21 42 51,500 05:17 05:58 To be announced.
total 228 703 928,500 99:01 115:16

The following paper should be cited to fulfill the CC attribution condition of the license for these resources:

spoken_offline_corpora.txt · Last modified: 2020/05/15 10:29 by pezik