=====The PLLuM Instruction Corpus===== ====Description==== We release the first representative subset of the PLLuM Instruction Corpus (PLLuMIC), which we believe to be useful in guiding and planning the development of similar LLM datasets. PLLuMIC is a hand-crafted set of LLM fine-tuning Polish language instructions, developed in line with the annotation guidelines and covering a functional typology. The corpus is described in more detail in a forthcoming paper titled //The PLLuM Instruction Corpus// (Pęzik et al. 2025). We plan regular updates and significant extensions of the corpus. ---- ====Statistics==== ===Total number of instructions=== * 1278 ===Type distribution=== * Adversarial: 125 * CoT: 50 * Data manipulation: 88 * Dialogue: 124 * Extraction: 71 * Formatting: 87 * Generation: 392 * Identity: 68 * Knowledge (QA): 80 * NLP: 102 * Programming: 30 * Translation: 61 ===Thematic distribution=== * Art: 14 * Astronomy: 5 * Automotive: 6 * Biology: 78 * Chemistry: 7 * Computer science: 163 * Culinary: 52 * Culture: 55 * Ecology: 4 * Economy: 19 * Entertainment: 85 * Geography: 59 * History: 48 * Home: 60 * Hobby: 4 * Industry: 20 * Languages: 185 * Law and administration: 31 * Literature: 50 * Mathematics: 15 * Medicine: 36 * Other: 73 * Philosophy: 5 * Physics: 8 * Politics: 42 * Psychology: 19 * Religion: 7 * Society: 169 * Sports: 26 * Technology: 87 * Travel: 25 ---- ====Apply for access==== To gain access to the PLLuMIC dataset, which was produced as a result of the CLARIN-BIZ project, we kindly ask you to take a moment to complete the form provided below. Your cooperation in this process is greatly appreciated, and we thank you for your interest in our work. [[https://forms.office.com/e/k1dRxBtaZT|PLLuMIC access form]] ---- ====Dataset file explanation==== The PLLuMIC dataset is distributed as a JSON file storing a list of conversations between a user and an AI assistant. Each conversation is also a JSON file described by following fields: ===Top-Level Fields === * **dataset_name**: Name of the dataset (''PLLuMIC''). * **dataset_source**: Source organization (''CLARIN-BIZ-bis''). * **conv_id**: Unique identifier for the conversation (''3242183cbce2''). * **messages**: Array of dialogue messages (user/assistant/system exchanges). ===Message Object Fields=== Each entry in ''messages'' contains: * **instruction_id**: Unique ID for the instruction/task (''2a07c2eca0cb''). * **seq**: Sequence number (''-1'' for system, ''0,1,2,...'' for user/assistant turns). * **role**: Speaker role (''system'', ''user'', or ''assistant''). * **content**: Text of the message (empty for some system prompts). * **type**: Interaction type (e.g., ''Dialog'', ''Generation''). * **subtype**: Task subtype (e.g., ''System prompt'', ''Text simplification''). * **topic**: Relevant topics (e.g., ''Geography''). * **language**: Language code (e.g., ''pol'' for Polish). * **source**: References (e.g., Wikipedia URLs). ---- ===Disclaimer=== Please do not redistribute.