Task 4: Polish Speech Emotion Recognition Challenge
Task repository
https://github.com/poleval/2025-speech-emotion
Introduction
Speech emotion recognition (SER) represents a critical area of research due to its extensive potential applications. Recent advancements in automatic speech recognition (ASR) and large language models (LLMs) have led to new possibilities for the development of SER systems. However, given that SER combines both audio processing and natural language understanding, the field still faces challenges. The subtle emotional cues are conveyed not only by what is said, but also by how it is said.
The difficulty of this task is caused by the subjective nature of emotions, both in their expression and perception. Each person may experience and interpret emotions differently, depending on factors such as language, cultural background, and situational context. Additionally, even subtle variations in speech can complicate generalization and reduce robustness, especially in the case of low-resource languages or challenging acoustic conditions.
In order to promote research in this area, we introduce the Polish Speech Emotion Recognition Challenge. The goal of this task is to evaluate how well current systems can identify emotional states from speech across diverse conditions, languages and speakers.
Task Definition
The goal of this task is to develop an automatic system for classifying emotions based on audio recordings of human speech. Given an audio sample as input, the system should output a single label indicating the emotional state of the speaker.
This is a multi-class classification problem, with the following six target emotional states: anger, fear, happiness, sadness, surprise, and neutral.
Participants will receive a train set consisting of speech recordings in seven languages: Bengali, English, French, German, Italian, Russian, and Spanish. In addition, a validation set containing speech recordings in Polish will be provided for evaluation purposes.
The use of validation set for training - whether directly or indirectly (e.g. through data augmentation) - is strictly prohibited.
The test set will consist of previously unseen audio recordings of human speech in Polish. It is forbidden to manually label the test samples.
Participants may use transfer learning or pretrained models, provided that these models have not been trained or fine-tuned on Polish data or on nEMO dataset [1].
Participants are required to work strictly within the provided dataset. The use of external resources or any additional data, including publicly available datasets, is not allowed.
Dataset
Dataset Format
Each set consists of: - Audio recordings, which where resampled to 16kHz and saved as FLAC. - JSONL file containing the following metadata for each sample: - id
: A unique identifier for the audio sample corresponding to the filename. - emotion
: The expected label for the sample. - transcription
: The textual transcription of the speech. - speaker_id
: An identifier for the speaker (None
if not available). - gender
: The gender of the speaker (None
if not avaiable). - age
: The age of the speaker (None
if not available). - dataset
: The name of the original dataset from which the sample was sourced. - language
: The language spoken in the audio sample.
For test set, the expected labels are not provided.
Train set
The train set consists of 29 714 audio recordings from 12 different datasets: CaFE [2], CREMA-D [3], EMNS [4], Emozionalmente [5], eNTERFACE [6], JL-Corpus [7], MESD [8, 9], Oreau [10], PAVOQUE [11], RAVDESS [12], RESD [13], and SUBESCO [14].
The details of the language and distribution of samples per emotion in each subset, are shown in the table below.
Dataset | Language | # samples | anger | fear | happiness | neutral | sadness | surprise |
---|---|---|---|---|---|---|---|---|
CaFE | French | 792 | 144 | 144 | 144 | 72 | 144 | 144 |
CREMA-D | English | 6 171 | 1 271 | 1 271 | 1 271 | 1 087 | 1 271 | - |
EMNS | English | 743 | 133 | - | 158 | 149 | 150 | 153 |
Emozionalmente | Italian | 5 916 | 986 | 986 | 986 | 986 | 986 | 986 |
eNTERFACE | English | 1 047 | 210 | 210 | 207 | - | 210 | 210 |
JL-Corpus | English | 960 | 240 | - | 240 | 240 | 240 | - |
MESD | Spanish | 718 | 143 | 144 | 144 | 143 | 144 | - |
Oreau | French | 431 | 73 | 71 | 72 | 71 | 72 | 72 |
PAVOQUE | German | 4 867 | 601 | - | 584 | 3 126 | 556 | - |
RAVDESS | English | 1 056 | 192 | 192 | 192 | 96 | 192 | 192 |
RESD | Russian | 1 013 | 219 | 223 | 218 | 191 | 162 | - |
SUBESCO | Bengali | 6 000 | 1 000 | 1 000 | 1 000 | 1 000 | 1 000 | 1 000 |
Total | 29 714 | 5 212 | 4 241 | 5 216 | 7 161 | 5 127 | 2 757 |
Validation set
The validation set consists of 4 481 audio recordings in Polish language from nEMO dataset [1]. The details of the distribution of samples per emotion, are shown in the table below.
anger | fear | happiness | neutral | sadness | surprise | |
---|---|---|---|---|---|---|
# samples | 749 | 736 | 749 | 809 | 769 | 669 |
Test set
TBA
Evaluation
The solutions will be evaluated based on two metrics:
where:
References
-
I. Christop. 2024. nEMO: Dataset of Emotional Speech in Polish.
-
P. Gournay, O. Lahaie, and R. Lefebvre. 2018. A canadian french emotional speech dataset.
-
H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma. 2014. CREMA-D: Crowd-sourced emotional multimodal actors dataset.
-
K. A. Noriy, X. Yang, and J. J. Zhang. 2023. EMNS /Imz/ Corpus: An emotive single-speaker dataset for narrative storytelling in games, television and graphic novels.
-
F. Catania, J. W. Wilke, and F. Garzotto. 2020. Emozionalmente: A Crowdsourced Corpusof Simulated Emotional Speech in Italian.
-
O. Martin, I. Kotsia, B. Macq, and I. Pitas. 2006. The eNTERFACE’05 Audio-Visual Emotion Database.
-
J. James, L. Tian, and C. I. Watson. 2018. An Open Source Emotional Speech Corpus for Human Robot Interaction Applications.
-
M. M. Duville, L. M. Alonso-Valerdi, and D. I. Ibarra-Zarate. 2021. The Mexican Emotional Speech Database (MESD): elaboration and assessment based on machine learning.
-
M. M. Duville, L. M. Alonso-Valerdi, and D. I. Ibarra-Zarate. 2021. Mexican Emotional Speech Database Based on Semantic, Frequency, Familiarity, Concreteness, and Cultural Shaping of Affective Prosody.
-
L. Kerkeni, C. Cleder, Y. Serrestou, and K. Raoof. 2020.French Emotional Speech Database - Oréau.
-
I. Steiner, M. Schröder, and A. Klepp. 2013. The PAVOQUE corpus as a resource for analysis and synthesis of expressive speech.
-
S. R. Livingstone and F. A. Russo. 2018. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English.
-
А. Аментес, И. Лубенец, and Н. Давидчук. 2022. Открытая библиотека искусственного интеллекта для анализа и выявления эмоциональных оттенков речи человека.
-
S. Sultana, M. S. Rahman, M. R. Selim, and M. Z. Iqbal. 2021. SUST Bangla Emotional Speech Corpus (SUBESCO): An audio-only emotional speech corpus for Bangla.