Table of Contents

The PLLuM Instruction Corpus

Description

We release the first representative subset of the PLLuM Instruction Corpus (PLLuMIC), which we believe to be useful in guiding and planning the development of similar LLM datasets. PLLuMIC is a hand-crafted set of LLM fine-tuning Polish language instructions, developed in line with the annotation guidelines and covering a functional typology. The corpus is described in more detail in a forthcoming paper titled The PLLuM Instruction Corpus (Pęzik et al. 2025). We plan regular updates and significant extensions of the corpus.


Statistics

Total number of instructions

Type distribution

Thematic distribution


Apply for access

To gain access to the PLLuMIC dataset, which was produced as a result of the CLARIN-BIZ project, we kindly ask you to take a moment to complete the form provided below. Your cooperation in this process is greatly appreciated, and we thank you for your interest in our work.

PLLuMIC access form


Dataset file explanation

The PLLuMIC dataset is distributed as a JSON file storing a list of conversations between a user and an AI assistant. Each conversation is also a JSON file described by following fields:

Top-Level Fields

Message Object Fields

Each entry in messages contains:


Disclaimer

Please do not redistribute.