Task 2: Gender-inclusive LLMs for Polish

Task repository

https://github.com/poleval/2025-gender-inclusive-llms

Introduction

Polish is a grammatical gender language in which all nouns inherently encode gender markers as an integral part of the grammatical system. For example, śliwka [a plum] is feminine, jabłko [an apple] is neutral, whereas pomidor [a tomato] is masculine. All adjectives and verbs associated with a noun must match the noun's grammatical gender. Additionally, personal nouns have distinct feminine, e.g., nauczycielka [a teacherfem] and masculine forms, e.g., nauczyciel [a teachermasc]. While feminine personal nouns typically denote female individuals or groups of females, masculine personal nouns refer not only to male individuals or male groups but also to mixed-gender groups and even females, a phenomenon known as the generic masculine, e.g., niemiecka polityk Ursula von der Leyen [Germanfem politicianmasc Ursulafem von der Leyen].

Although the grammatical system of Polish allows for naming individuals according to their natural gender (i.e., female or male), standard Polish remains heavily masculine-centric. This is reflected in a strong dominance of masculine expressions over feminine ones, which may be interpreted as reinforcing gender bias and exclusion.

One social consequence of this linguistic system is that current large language models (LLMs) trained on Polish texts inherit and reinforce masculine bias, generating gender-imbalanced outputs. As LLMs become increasingly integrated into communication, translation, and content generation systems, ensuring their outputs reflect gender inclusivity is crucial, particularly in gender-rich languages like Polish.

The dominance of masculine expressions over feminine ones in a language is a form of gender discrimination (GEC,  GNL-EU). Acknowledging the harmful effects of sexist language, the Council of Europe encourages its member states to eliminate sexism from language and to adopt practices that support gender equality. In line with this recommendation, we introduce a task focused on developing gender-inclusive LLMs for Polish.

gender_inclusive_language

Make the world more equal! Join the gender-inclusive challenge!

Task description

Task objective

The task aims to raise community awareness of gender inequalities in Polish and to develop LLMs capable of generating grammatically correct and gender-aware language across various contexts. Participants are challenged to embed gender inclusivity as a core feature of LLMs, offering a solution to mitigating gender bias in Polish language generation. By advancing gender-inclusive LLMs, this shared task contributes to broader efforts to promote gender equality through language, highlighting the potential of AI to facilitate more inclusive and equitable communication.

Specification

  • Each submitted gender-inclusive LLM will be evaluated on two tasks:
A. Gender-inclusive proofreading — transforming a text passage written in standard Polish into its gender-inclusive version.
B. Gender-sensitive Polish⇄English translation — translating a text passage written in gender-inclusive Polish into English or an English text passage into a gender-inclusive Polish translation.
 
  • Data: The Inclusive Polish Instruction Set (IPIS) is made available to all participants.
  • Task run:

Working phase: Using the train and dev subsets of the IPIS dataset, participants are expected to enhance an open-source LLM to ensure gender inclusivity.

Testing phase: Using the test subset of the IPIS dataset, submitted outputs of the gender-inclusive LLM will be evaluated in the PolEval system.

  • System prompt: The system prompts with gender-inclusive guidelines based on Wróblewska et al. (2025) are available in the task repository. Participants are encouraged to apply the provided system prompts during training and inference.
  • While modifications to the system prompt and alternative uses of the IPIS dataset, such as for data augmentation, are permitted, these must align with the task requirements and principles of fair competition.

Task constraints

  1. Participants are permitted to use publicly available pretrained language models, including both Polish-specific and multilingual models.
  2. The use of proprietary or closed-source LLMs is prohibited.
  3. The training and development subsets of the IPIS dataset may be used freely for any task-related purpose, including but not limited to LLM instruction-tuning, fine-tuning, and data augmentation.
  4. Participants may use publicly accessible linguistic resources, such as Polish corpora, lexical databases, knowledge graphs, and other structured data resources.
  5. All external resources and models used for developing a gender-inclusive LLM must be clearly documented in the final submission, including appropriate bibliographic references and/or direct URLs.
  6. The use of non-public datasets, tools, or models is strictly forbidden.
  7. It is prohibited to input any portion of the IPIS dataset — whether training or development instances — into proprietary LLMs (e.g. ChatGPT, Claude) for any reason, including data augmentation.
  8. Each team is allowed to submit a maximum of three runs per task.
  9. Participants are expected to prepare a short article, describing their solution with enough details to allow replication of the research.

IPIS dataset

Inclusive Polish Instruction Set (Wróblewska and Żuk, 2025) is a collection of instructions designed to improve the gender sensitivity and inclusiveness of LLMs in the Polish language scenario. The IPIS dataset is built on a gender-inclusive text corpus manually annotated in the PLLuM project.

IPIS format

A Subtask

Each IPIS-proofreading sample consists of three main components:

  1. user prompt (prompt) — a specification of the given task,
  2. input text passage (source) — a text passage requiring a gender-inclusive proofreading,
  3. desired output (target) — the expected response corresponding to the user instruction and input text passage. This serves as the ground truth for evaluating and optimising LLM's predictions.

B Subtask

Each IPIS-translation sample consists of three main components and lanugage specifications:

  1. user prompt (prompt) — a specification of the given task,
  2. input text passage (source) — a text passage to translate,
  3. desired output (target) — an expected translation in standard English or gender-inclusive Polish. This serves as the ground truth for evaluating and optimising LLM's predictions,
  4. prompt_language — the language of prompt (either EN or PL)
  5. source_language — the language of a passage to translate, either inclusive Polish (PL) or standard English (EN)
  6. target_language — the language of a reference translation, either standard English (EN) or gender-inclusive Polish (PL)

IPIS size

A Subtask

The gender-inclusive proofreading testdev and train subsets consist of 5278, 2732 and 23,532 instances, respectively. All IPIS test, dev and train subsets are balanced for the ratio of gender-inclusive transformations.

B Subtask

The gender-sensitive translation test and train subsets consist of 760 and 1728 instances, respectively.

Evaluation

Methodology

B Subtask

To evaluate the ability of the gender-inclusive LLM to process and generate gender-inclusive Polish in the Polish⇄English translation scenario, its outcomes are compared against gold standard test instances and ranked using the primary metric:

The translation quality is additionally evaluated with two secondary metrics:

A Subtask

To evaluate the ability of the gender-inclusive LLM to generate gender-inclusive language, its outcomes are compared against gold standard test instances. The normalised LLM-generated texts are evaluated with the primary metric:

The textual quality of LLM-proofread passages is additionally evaluated using automatic secondary metrics:

Normalisation procedure

Various gender-inclusive alternatives are possible, e.g. for posłowie [deputies]:

posłanki i posłowie
posłowie i posłanki
posłowie/posłanki
posłanki/posłowie
posł*owie/anki

For evaluation, the gender-inclusive generated_target samples should be normalised. The normalisation process consists in expanding all occurrences of gender-inclusive expressions, especially those containing slashes or gender stars (asterisks), into their full masculine and feminine forms and then filtering out predefined stop words (i.e. punctuation marks, subordinating and coordinating conjunctions). For normalisation steps, we use Lambo (Przybyła, 2022) for tokenisation and Combo (Klimaszewski and Wróblewska, 2021) for part-of-speech tagging.

A Python script for normalisation is included in the task repository.

Baseline

The baseline corresponds to the best off-the-shelf LLM (-base). The state of the art corresponds to the best LLM instruction-tuned on the IPIS train subset (-ipis) and possibly guided by a system prompt in Polish (-pl) or English (-en). We have tested multilingual models Llama-8BMistral-7B and Mistral-Nemo, and Polish-specific models Bielik-7BLlama-PLLuM-8BBielik-11B and PLLuM-12B.

A Subtask

Baseline: PLLuM-12B-default-pl

  • precision: 2.56
  • recall: 6.28
  • F1: 3.64
  • BLEU: 63.59
  • chrF: 82.74

SOTA: Bielik-11B-tuned

  • precision: 63.93
  • recall: 56.26
  • F1: 59.85
  • BLEU: 95.22
  • chrF: 97.81

B Subtask

Baseline PL-to-EN: Mistral-Nemo-12B-default

  • BLEU: 53.68
  • chrF: 75.35

SOTA PL-to-EN: Bielik-12B-tuned-en

  • BLEU: 57.55
  • chrF: 78.03

Baseline EN-to-PL: Bielik-12B-default

  • BLEU: 41.49
  • chrF: 71.78

SOTA EN-to-PL: Bielik-11B-fewshot-en

  • BLEU: 43.02
  • chrF: 72.46

Submission

Submission format

Examples of the required submission format can be found in the Task repository.

Evaluation platform

 

Submission components for both subtasks

  1. A file containing a list of instances, each formatted as a JSON object. 
  2. A link to the gender-inclusive LLM participating in the shared task.
  3. Names, emails and institutional affiliations of all team members.

References