Task 2: Gender-inclusive LLMs for Polish
Task repository
https://github.com/poleval/2025-gender-inclusive-llms
Introduction
Polish is a grammatical gender language in which all nouns inherently encode gender markers as an integral part of the grammatical system. For example, śliwka [a plum] is feminine, jabłko [an apple] is neutral, whereas pomidor [a tomato] is masculine. All adjectives and verbs associated with a noun must match the noun’s grammatical gender. Additionally, personal nouns have distinct feminine, e.g., nauczycielka [a teacherfem] and masculine forms, e.g., nauczyciel [a teachermasc]. While feminine personal nouns typically denote female individuals or groups of females, masculine personal nouns refer not only to male individuals or male groups but also to mixed-gender groups and even females, a phenomenon known as the generic masculine, e.g., niemiecka polityk Ursula von der Leyen [Germanfem politicianmasc Ursulafem von der Leyen].
Although the grammatical system of Polish allows for naming individuals according to their natural gender (i.e., female or male), standard Polish remains heavily masculine-centric. This is reflected in a strong dominance of masculine expressions over feminine ones, which may be interpreted as reinforcing gender bias and exclusion.
One social consequence of this linguistic system is that current large language models (LLMs) trained on Polish texts inherit and reinforce masculine bias, generating gender-imbalanced outputs. As LLMs become increasingly integrated into communication, translation, and content generation systems, ensuring their outputs reflect gender inclusivity is crucial, particularly in gender-rich languages like Polish.
The dominance of masculine expressions over feminine ones in a language is a form of gender discrimination (GEC, GNL-EU). Acknowledging the harmful effects of sexist language, the Council of Europe encourages its member states to eliminate sexism from language and to adopt practices that support gender equality. In line with this recommendation, we introduce a task focused on developing gender-inclusive LLMs for Polish.
Task description
Task objective
The task aims to raise community awareness of gender inequalities in Polish and to develop LLMs capable of generating grammatically correct and gender-aware language across various contexts. Participants are challenged to embed gender inclusivity as a core feature of LLMs, offering a solution to mitigating gender bias in Polish language generation. By advancing gender-inclusive LLMs, this shared task contributes to broader efforts to promote gender equality through language, highlighting the potential of AI to facilitate more inclusive and equitable communication.
Specification
Subtask: LLMs will be evaluated on the gender-inclusive proofreading task, which involves transforming a text passage written in standard Polish into its gender-inclusive version.
Subtask: LLMs will be evaluated on the gender-sensitive Polish⇄English translation task, which involves translating a text passage written in gender-inclusive Polish into English or an English text passage into a gender-inclusive Polish translation.
Data: Inclusive Polish Instruction Set (IPIS) is made available to all participants.
Working phase: Using the train and dev subsets of the IPIS dataset, participants are expected to enhance open-source LLMs to ensure gender inclusivity.
Testing phase: Using the test subset of the IPIS dataset, submitted outputs of gender-inclusive LLMs will be evaluated in the PolEval system.
System prompt: The system prompts with gender-inclusive guidelines based on Wróblewska et al. (2025) are available in the task repository. Participants are encouraged to apply the provided system prompts during training and inference.
While modifications to the system prompt and alternative uses of the IPIS dataset, such as for data augmentation, are permitted, these must align with the task requirements and principles of fair competition.
Task constraints
- Participants are permitted to use publicly available pretrained language models, including both Polish-specific and multilingual models.
- The use of proprietary or closed-source LLMs is prohibited.
- The training and development subsets of the IPIS dataset may be used freely for any task-related purpose, including but not limited to LLM instruction-tuning, fine-tuning, and data augmentation.
- Participants may use publicly accessible linguistic resources, such as Polish corpora, lexical databases, knowledge graphs, and other structured data resources.
- All external resources and models used for developing a gender-inclusive LLM must be clearly documented in the final submission, including appropriate bibliographic references and/or direct URLs.
- The use of non-public datasets, tools, or models is strictly forbidden.
- It is prohibited to input any portion of the IPIS dataset — whether training or development instances — into proprietary LLMs (e.g. ChatGPT, Claude) for any reason, including data augmentation.
- Each team is allowed to submit a maximum of three runs per task.
- Participants are expected to prepare a short article, describing their solution with enough details to allow replication of the research.
IPIS dataset
Inclusive Polish Instruction Set (Wróblewska and Żuk, 2025) is a collection of instructions designed to improve the gender sensitivity and inclusiveness of LLMs in the Polish language scenario. The IPIS dataset is built on a gender-inclusive text corpus manually annotated in the PLLuM project.
IPIS format
Subtask
Each IPIS-proofreading sample consists of three main components:
- user prompt (
prompt
) — a specification of the given task, - input text passage (
source
) — a text passage requiring a gender-inclusive proofreading, - desired output (
target
) — the expected response corresponding to the user instruction and input text passage. This serves as the ground truth for evaluating and optimising LLM’s predictions.
{'source_resource_id': 'nlprepl-nkjp1m-dev',
'ipis_id': 'IPIS_proofreading_dev_2143',
'prompt': 'Przeredaguj tekst w standardowym języku polskim, aby nie zawierał treści krzywdzących i wykluczających. Tekst źródłowy: ',
'source': 'W 24-osobowym składzie sędziowskim znalazło się 8 Polaków, pianistów i pedagogów. W przypadku własnych uczniów nie będą mieli prawa głosu. Pojawią się też m.in. laureat nagrody za najlepsze nagranie Roku Chopinowskiego - Emanuel Ax i rosyjska pianistka reprezentująca USA - Bella Davidovich.',
'target': 'W 24-osobowym składzie sędziowskim znalazło się 8 Pol*aków/ek, pianist*ów/ek i pedago*gów/żek. W przypadku własnych ucz*niów/ennic nie będą mi*eli/ały prawa głosu. Pojawią się też m.in. laureat nagrody za najlepsze nagranie Roku Chopinowskiego - Emanuel Ax i rosyjska pianistka reprezentująca USA - Bella Davidovich.',
'messages': [{'content': 'Przeredaguj tekst w standardowym języku polskim, aby nie zawierał treści krzywdzących i wykluczających. Tekst źródłowy: W 24-osobowym składzie sędziowskim znalazło się 8 Polaków, pianistów i pedagogów. W przypadku własnych uczniów nie będą mieli prawa głosu. Pojawią się też m.in. laureat nagrody za najlepsze nagranie Roku Chopinowskiego - Emanuel Ax i rosyjska pianistka reprezentująca USA - Bella Davidovich.', 'role': 'user'}, {'content': 'W 24-osobowym składzie sędziowskim znalazło się 8 Pol*aków/ek, pianist*ów/ek i pedago*gów/żek. W przypadku własnych ucz*niów/ennic nie będą mi*eli/ały prawa głosu. Pojawią się też m.in. laureat nagrody za najlepsze nagranie Roku Chopinowskiego - Emanuel Ax i rosyjska pianistka reprezentująca USA - Bella Davidovich.', 'role': 'assistant'}],
'normalised_target': null}
Subtask
Each IPIS-translation sample consists of three main components and lanugage specifications:
- user prompt (
prompt
) — a specification of the given task, - input text passage (
source
) — a text passage to translate, - desired output (
target
) — an expected translation in standard English or gender-inclusive Polish. This serves as the ground truth for evaluating and optimising LLM’s predictions, prompt_language
— the language ofprompt
(either EN or PL)source_language
— the language of a passage to translate, either inclusive Polish (PL) or standard English (EN)target_language
— the language of a reference translation, either standard English (EN) or gender-inclusive Polish (PL)
{"source_resource_id": "EU_Karta_Praw_Podstawowych",
"ipis_id": "IPIS_translation_dev_117",
"prompt": "Translate into inclusive Polish. Text to translate: ",
"source": "Article 28\nRight of collective bargaining and action\nWorkers and employers, or their respective organisations, have, in accordance with Union law and national laws and practices, the right to negotiate and conclude collective agreements at the appropriate levels and, in cases of conflicts of interest, to take collective action to defend their interests, including strike action.",
"target": "Artykuł 28\nPrawo do rokowań i działań zbiorowych\nPracownic*y/e i pracodaw*cy/czynie, lub ich odpowiednie organizacje, mają, zgodnie z prawem Unii oraz ustawodawstwami i praktykami krajowymi, prawo do negocjowania i zawierania układów zbiorowych pracy na odpowiednich poziomach oraz do podejmowania, w przypadkach konfliktu interesów, działań zbiorowych, w tym strajku, w obronie swoich interesów.",
"prompt_language": "EN",
"source_language": "EN",
"target_language": "PL",
"messages": [{"content": "Translate into inclusive Polish. Text to translate: Article 28\nRight of collective bargaining and action\nWorkers and employers, or their respective organisations, have, in accordance with Union law and national laws and practices, the right to negotiate and conclude collective agreements at the appropriate levels and, in cases of conflicts of interest, to take collective action to defend their interests, including strike action.", "role": "user"}, {"content": "Artykuł 28\nPrawo do rokowań i działań zbiorowych\nPracownic*y/e i pracodaw*cy/czynie, lub ich odpowiednie organizacje, mają, zgodnie z prawem Unii oraz ustawodawstwami i praktykami krajowymi, prawo do negocjowania i zawierania układów zbiorowych pracy na odpowiednich poziomach oraz do podejmowania, w przypadkach konfliktu interesów, działań zbiorowych, w tym strajku, w obronie swoich interesów.", "role": "assistant"}]}
IPIS size
Subtask
The gender-inclusive proofreading test, dev and train subsets consist of 5278, 2732 and 23,532 instances, respectively. All IPIS test, dev and train subsets are balanced for the ratio of gender-inclusive transformations.
Subtask
The gender-sensitive translation test, dev and train subsets consist of 456, 304 and 1728 instances, respectively.
Evaluation
Methodology
Subtask
To evaluate the ability of LLMs to process and generate gender-inclusive Polish in the Polish⇄English translation scenario, their outcomes are compared against gold standard test instances and ranked using the primary metric:
- chrF (Popović, 2015)
The translation quality is additionally evaluated with two secondary metrics:
- chrF++
- BLEU (Papieni et al., 2002).
Subtask
To evaluate the ability of LLMs to generate gender-inclusive language, their outcomes are compared against gold standard test instances. The normalised LLM-generated texts are evaluated with the primary metric:
- F1-measure (see Normalisation procedure for details how to normalise LLMs’ outputs).
The textual quality of LLM-proofread passages is additionally evaluated using automatic secondary metrics:
- chrF and chrF++ (Popović, 2015)
- BLEU (Papieni et al., 2002).
Normalisation procedure
Various gender-inclusive alternatives are possible, e.g. for posłowie [deputies]:
posłanki i posłowie
posłowie i posłanki
posłowie/posłanki
posłanki/posłowie
posł*owie/anki
For evaluation, the gender-inclusive generated_target
samples should be normalised. The normalisation process consists in expanding all occurrences of gender-inclusive expressions, especially those containing slashes or gender stars (asterisks), into their full masculine and feminine forms and then filtering out predefined stop words (i.e. punctuation marks, subordinating and coordinating conjunctions). For normalisation steps, we use Lambo (Przybyła, 2022) for tokenisation and Combo (Klimaszewski and Wróblewska, 2021) for part-of-speech tagging.
A Python script for normalisation is included in the task repository.
Baseline
The baseline corresponds to the best off-the-shelf LLM (-base). The state of the art corresponds to the best LLM instruction-tuned on the IPIS train subset (-ipis) and possibly guided by a system prompt in Polish (-pl) or English (-en). We have tested multilingual models Llama-8B, Mistral-7B and Mistral-Nemo, and Polish-specific models Bielik-7B, Llama-PLLuM-8B, Bielik-11B and PLLuM-12B.
Subtask
LLM | F1 | chrF | chrF++ | BLEU | |
---|---|---|---|---|---|
Baseline | LLM-base | ||||
SOTA | LLM-tuned |
Subtask
LLM | chrF | chrF++ | BLEU | |
---|---|---|---|---|
Baseline | LLM-base | |||
SOTA | LLM-tuned |
Submission
Subtask
Test instances provided to participants in the testing phase will have the following structure:
{'source_resource_id': 'example-test',
'ipis_id': 'IPIS_proofreading_test_X',
'prompt': 'Przeredaguj tekst napisany w standardowym języku polskim, aby nie zawierał treści krzywdzących i wykluczających. Tekst źródłowy: ',
'source': 'W 24-osobowym składzie sędziowskim znalazło się 8 Polaków, pianistów i pedagogów. W przypadku własnych uczniów nie będą mieli prawa głosu. Pojawią się też m.in. laureat nagrody za najlepsze nagranie Roku Chopinowskiego - Emanuel Ax i rosyjska pianistka reprezentująca USA - Bella Davidovich.',
'target': null,
'normalised_target': null}
JSON test instances submitted by participants must include the following fields:
ipis_id
with the original IPIS IDgenerated_target
with a text passage proofread by the participating LLMnormalised_target
containing its normalised form
{'ipis_id': 'IPIS_proofreading_test_X',
'generated_target': 'W 24-osobowym składzie sędziowskim znalazło się 8 Pol*aków/ek, pianist*ów/ek i pedago*gów/żek. W przypadku własnych ucz*niów/ennic nie będą mi*eli/ały prawa głosu. Pojawią się też m.in. laureat nagrody za najlepsze nagranie Roku Chopinowskiego - Emanuel Ax i rosyjska pianistka reprezentująca USA - Bella Davidovich.',
'normalised_target': ['w', '24-osobowym', 'składzie', 'sędziowskim', 'znalazło', 'się', '8', 'polaków', 'polek', 'pianistów', 'pianistek', 'pedagogów', 'pedagożek', 'w', 'przypadku', 'własnych', 'uczniów', 'uczennic', 'nie', 'będą', 'mieli', 'miały', 'prawa', 'głosu', 'pojawią', 'się', 'też', 'm.in', 'laureat', 'nagrody', 'za', 'najlepsze', 'nagranie', 'roku', 'chopinowskiego', 'emanuel', 'ax', 'rosyjska', 'pianistka', 'reprezentująca', 'usa', 'bella', 'davidovich']}
For normalisation, we recommend using a Python normalisation script included in the task repository.
Subtask
Test instances provided to participants in the testing phase will have the following structure:
{'source_resource_id': 'example-test',
'ipis_id': 'IPIS_translation_test_X',
'prompt': 'Translate the text from Polish to English:',
'source': 'W 24-osobowym składzie sędziowskim znalazło się 8 Pol*aków/ek, pianist*ów/ek i pedago*gów/żek. W przypadku własnych ucz*niów/ennic nie będą mi*eli/ały prawa głosu. Pojawią się też m.in. laureat nagrody za najlepsze nagranie Roku Chopinowskiego - Emanuel Ax i rosyjska pianistka reprezentująca USA - Bella Davidovich.',
'target': null,
'prompt_language': 'EN',
'source_language': 'PL',
'target_language': 'EN'}
JSON test instances submitted by participants must include the following fields:
ipis_id
with the original IPIS IDgenerated_target
with a text passage translated by the participating LLM
{'ipis_id': 'IPIS_translation_test_X',
'generated_target': 'Among the 24-member jury are 8 Poles — pianists and educators. They will not be allowed to vote for their students. There are also, among others, Emanuel Ax, the winner of the award for the best recording of the Chopin Year, and Bella Davidovich, a Russian-born pianist representing the USA.'}
Submission components for both subtasks
- A
.jsonl
file containing a list of instances, each formatted as a JSON object. - A link to an LLM participating in the shared task.
- Names, emails and institutional affiliations of all team members.
References
- Alina Wróblewska and Bartosz Żuk (2025) IPIS: Inclusive Polish Instruction Set [to be published after the PolEval 2025 ends].
- Alina Wróblewska, Martyna Lewandowska, Aleksandra Tomaszewska, Karol Saputa and Maciej Ogrodniczuk (2025) Koncepcja form równościowych z asteryskiem inkluzywnym, Język Polski [to be published].