Task 1: ŚMIGIEL
Spotting Machine-Generated Text from LLMs for Polish
Task repository
https://github.com/poleval/2025-smigiel
Introduction
*śmigiel ['ɕmiɡjɛl] m I, D. ~gla gw. «płaski szczebel w drabinach wozu» (Eng. flat rung in a ladder wagon)
SJP W. Doroszewskiego*
The rapid progress of large language models (LLMs) in recent years has enabled the generation of highly fluent and linguistically correct texts in numerous languages. Although these models demonstrate strong performance in natural language processing (NLP) and natural language generation (NLG) tasks, their reliance on human-authored data and their capacity to emulate human writing styles raise critical concerns regarding authenticity, authorship attribution, and the potential for misuse. In response to these challenges, there is a growing research focus on developing systems for machine-generated text detection — tools designed to distinguish between human-authored and AI-generated content.
With the ongoing development of machine-generated text (MGT) systems, the need for their evaluation and benchmarking has become increasingly important. This challenge has already given rise to several shared tasks, e.g. GenAI Content Detection at COLING 2025, SemEval-2024 Task 8, and PAN 2025 Task 1. However, the majority of these initiatives have focused primarily on English texts. An important exception and a source of inspiration for our task is IberAuTexTification, which targets the languages of the Iberian Peninsula. The question of how difficult MGT detection is for the Polish language remains open.
Task description
Here, we outline ŚMIGIEL: a shared task on Spotting Machine-Generated Text from LLMs for Polish, organised at the PolEval 2025 evaluation campaign.
Objective
The main objective of this shared task is to benchmark and enhance the state-of-the-art in detecting machine-generated texts in the Polish language across various domains and textual genres. Robust and reliable MGT detection systems will undoubtedly contribute to the broader goals of responsible AI development, supporting critical areas such as media verification, academic and journalistic integrity, and, potentially, digital forensics.
Procedure
The task is framed as a binary classification problem, distinguishing between human-authored and machine-generated texts. Participating systems will be presented with a collection of text fragments, and they have to determine whether each was written by a human or generated automatically by a large language model (LLM). To promote the development of models capable of generalising across diverse writing styles, the training and testing datasets come from different domains.
ŚMIGIEL subtasks
The participants can choose to submit their solutions in three subtasks:
- UNSUPERVISED: classifiers that are prepared without the use of training data,
- CONSTRAINED: classifiers that are trained only on the dataset provided by the organisers,
- OPEN: classifiers that are trained in any way, including external datasets and data augmentation.
Submissions within these subtasks will be evaluated and ranked separately.
Task constraints
- The use of publicly available, pre-trained models, both Polish-specific and multilingual, is permitted.
- Participants may use publicly accessible Polish corpora, lexical resources, knowledge bases, and other structured data resources.
- Participants are expected to prepare a short article describing their solution with enough details to allow replication of the research.
- All external models and resources used must be listed in the submission, including bibliographic references or direct links.
- The use of proprietary or non-public datasets, models, or services is strictly prohibited.
- Each team is allowed a maximum of three submissions per subtask.
Dataset
The training data are stored in the task repository.
Content
Human-written texts originate from various sources and are available under open licenses for research purposes. The texts cover the following domains:
- customer reviews
- literature
- social media
- wikipedia
Machine-generated texts are prepared using a range of open-source LLMs of different sizes:
- small: LLama 3.1 8B, Bielik 7B, Mistral 7B
- medium: Bielik 11B, Mistral Nemo, PLLuM 12B
- large: Gemma 3 27B
The dataset is balanced as follows:
- It contains an equal proportion of human-written and LLM-generated instances.
- Domains are uniformly represented.
- It includes roughly one-third of the LLM-generated texts from each model size.
Format
Each text to be classified is placed on a separate line in the data.tsv file, while corresponding labels are stored in the labels.tsv file.
All texts, whether generated by LLMs or written by humans, are proportionally shortened to ensure consistency. This process involves truncating the content once it reaches a predefined character limit.
Training examples
- #1: The book discusses the statistical revolution which took place in the twentieth century, where science shifted from a deterministic view (Clockwork universe) to a perspective concerned primarily with probabilities and distributions and parameters.
- label: 0
- #2: I don’t eat, sleep, or forget things mid-sentence (unless you want me to role-play as someone who does)
- label: 1
Test data
Test data will be similar in nature, but will also be based on models and genres unseen during training.
Evaluation
Metrics
In the ŚMIGIEL task, Accuracy is used as the primary evaluation metric. Accuracy, defined as the proportion of correctly classified instances over the total number of instances, is widely used in text classification research (Jurafsky and Martin, 2023).
Baseline
The organisers will prepare some simple baseline solutions, relying on mainstream approaches to text classification.
Literature
Jurafsky, D. & Martin, J. H. (2024). Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models. 3rd edition.
Moreno-Sandoval, A., García-Vega, M., Casado-Molina, A., García-Hernández, R., García-Rodríguez, R., & García-Pablos, A. (2023). Overview of AuTexTification at IberLEF 2023: Detection and attribution of machine-generated text in multiple domains. arXiv.
Sarvazyan, A. M., González, J. Á., Franco-Salvador, M., Rangel, F., Chulvi, B., & Rosso, P. (2023). Overview of AuTexTification at IberLEF 2023: Detection and attribution of machine-generated text in multiple domains. arXiv.
Wang, Y., Mansurov, J., Ivanov, P., Su, J., Shelmanov, A., Tsvigun, A., Mohammed Afzal, O., Mahmoud, T., Puccetti, G., & Arnold, T. (2024). SemEval-2024 Task 8: Multidomain, Multimodal and Multilingual Machine-Generated Text Detection. In Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024) (pp. 2057–2079). Association for Computational Linguistics.