Task 4: Polish Speech Emotion Recognition Challenge
Task repository
https://github.com/poleval/2025-speech-emotion
Introduction
Speech emotion recognition (SER) is a growing area of research with broad applications, driven by advances in automatic speech recognition (ASR) and large language models (LLMs). SER is uniquely challenging because it relies on both what is said and how it is said - capturing subtle emotional cues through audio and language.
Emotions are subjective and influenced by cultural, linguistic, and contextual factors, making it difficult to generalize across speakers and languages. These challenges are especially pronounced in low-resource languages like Polish.
To address this, we present the Polish Speech Emotion Recognition Challenge. The goal is to develop models that recognize emotions in Polish speech by effectively leveraging data from other languages. This tests a system’s ability to generalize across linguistic and acoustic domains, encouraging cross-lingual approaches to SER.
Task Definition
The goal of this task is to build a system that classifies emotions from speech. Given an audio recording, the system should predict one of six emotional states: anger, fear, happiness, sadness, surprise, or neutral.
Participants will receive a train set consisting of speech recordings in seven languages: Bengali, English, French, German, Italian, Russian, and Spanish. Additionally, a validation set consisting of Polish speech will be provided for evaluation only. The use of validation set for training or augmentation is strictly prohibited.
The final test set will consist of previously unseen audio recordings of Polish speech. It is forbidden to manually label the test samples.
Participants may use transfer learning or pretrained models, provided that these models have not been trained or fine-tuned on Polish data or on nEMO dataset [1].
Participants are required to work strictly within the provided dataset. The use of external resources or any additional data, including publicly available datasets, is not allowed.
Dataset
The dataset used in this challenge is sourced from the CAMEO collection [15], which is a comprehensive collection of multilingual emotional speech corpora. The dataset is available on Hugging Face: https://huggingface.co/datasets/amu-cai/CAMEO.
Dataset Metadata and Structure
The CAMEO dataset provides rich metadata for each audio sample, as shown in the following example:
{
'file_id': 'e80234c75eb3f827a0d85bb7737a107a425be1dd5d3cf5c59320b9981109b698.flac',
'audio': {
'path': None,
'array': array([-3.05175781e-05, 3.05175781e-05, -9.15527344e-05, ...,
-1.49536133e-03, -1.49536133e-03, -8.85009766e-04]),
'sampling_rate': 16000
},
'emotion': 'neutral',
'transcription': 'Cinq pumas fiers et passionnés',
'speaker_id': 'cafe_12',
'gender': 'female',
'age': '37',
'dataset': 'CaFE',
'language': 'French',
'license': 'CC BY-NC-SA 4.0'
}
Data Fields
file_id
(str
): A unique identifier for the audio sample.audio
(dict
): A dictionary containing:path
(str
orNone
): Path to the audio file.array
(np.ndarray
): Raw waveform of the audio.sampling_rate
(int
): Sampling rate (16 kHz).
emotion
(str
): The expressed emotional state.transcription
(str
): Orthographic transcription of the utterance.speaker_id
(str
): Unique identifier of the speaker.gender
(str
): Gender of the speaker.age
(str
): Age of the speaker.dataset
(str
): Name of the original dataset.language
(str
): Language spoken in the sample.license
(str
): License under which the original dataset is distributed.
Download
All audio recordings and their corresponding metadata for this challenge are accessed directly through the Hugging Face datasets
library. Below is an example of how to load the dataset:
from datasets import load_dataset
dataset = load_dataset("amu-cai/CAMEO", split="cafe") # replace "cafe" with the desired split
Train Set
For this challenge, the following splits of the CAMEO dataset are used in the train set: cafe
, crema_d
, emns
, emozionalmente
, enterface
, jl_corpus
, mesd
, oreau
, pavoque
, ravdess
, resd
, and subesco
.
The audio recordings and metadata are not provided directly in this repository. Instead, they are accessed via the Hugging Face dataset. However, in the in.tsv
file, each line specifies the split name (corresponding to CAMEO splits) and the file_id
, ensuring precise mapping between the provided lists and the dataset hosted on Hugging Face. Example of the in.tsv
file is shown below.
cafe e9d4b7b83bd1f6825dabca3fc51acd62099b3ab70bd86f702495917b9a6541a9.flac
emozionalmente 2e4c53a24becdbf4f1b266439287f2e0d25d0bf29f0248e98480d19da62a97b1.flac
resd ebcea26cf1ffffdb66eed7d7468b5ea9183ee41ac41941e59a1c51b15e4c41b6.flac
Important Note: This challenge does not use all samples from the original CAMEO dataset. Only the samples representing relevant emotional states are included. These selected samples are listed in the in.tsv
file and used for the train
set.
The train set consists of audio recordings from 12 different datasets: CaFE [2], CREMA-D [3], EMNS [4], Emozionalmente [5], eNTERFACE [6], JL-Corpus [7], MESD [8, 9], Oreau [10], PAVOQUE [11], RAVDESS [12], RESD [13], and SUBESCO [14].
The details of the language and distribution of samples per emotion in each subset, are shown in the table below.
Dataset | Language | # samples | anger | fear | happiness | neutral | sadness | surprise |
---|---|---|---|---|---|---|---|---|
CaFE | French | |||||||
CREMA-D | English | - | ||||||
EMNS | English | - | ||||||
Emozionalmente | Italian | |||||||
eNTERFACE | English | - | ||||||
JL-Corpus | English | - | - | |||||
MESD | Spanish | - | ||||||
Oreau | French | |||||||
PAVOQUE | German | - | - | |||||
RAVDESS | English | |||||||
RESD | Russian | - | ||||||
SUBESCO | Bengali | |||||||
Total |
Validation Set
The validation set consists solely of the nemo
split of the CAMEO dataset. As with the training data, the audio recordings and metadata are accessible via Hugging Face.
The validation set consists of audio recordings in Polish language from nEMO dataset [1]. The details of the distribution of samples per emotion, are shown in the table below.
anger | fear | happiness | neutral | sadness | surprise | |
---|---|---|---|---|---|---|
# samples |
Test Set
The audio recordings for the test set (test-A
and test-B
) are provided directly in this repository (as test-A.tar.gz
and test-B.tar.gz
archives) and are not part of the CAMEO dataset on Hugging Face. The corresponding metadata for these samples is included in a JSONL file (metadata.jsonl
), mirroring the structure of the metadata for train
and dev
sets, except for the emotion
label, which participants are expected to predict.
Example of the metadata.jsonl
file is shown below.
{
"file_id":"bb7ee27f3e269c14b1b33538667ea806f20d7ba182cf9efe5d86a7a99085614f.flac",
"transcription":"Ochronię cię.",
"speaker_id":"SB0",
"gender":"male",
"age":"24",
"dataset":"test",
"language":"Polish",
"license":"CC BY-NC-SA 4.0"
}
Submission Format
The out.tsv
file must contain exactly one label per line, corresponding to the emotional state predicted for each audio sample. Each line in out.tsv
should match the audio sample listed in the same line of the in.tsv
file, and should contain only the label, with no additional information. Example of the out.tsv
file is shown below.
neutral
anger
happiness
Evaluation
The primary evaluation metric for this challenge is macro-averaged F1 score (F1-macro). Additionally, overall accuracy will be reported as a secondary metric.
The F1-macro score is defined as:
where is the number of classes, and
is the F1-score for class
, calculated as:
with:
Example
Given:
y_true=[
'happiness', 'happiness', 'neutral', 'surprise', 'neutral',
'happiness', 'sadness', 'sadness', 'fear', 'sadness',
]
y_pred=[
'surprise', 'sadness', 'happiness', 'surprise', 'anger',
'happiness', 'anger', 'happiness', 'sadness', 'happiness',
]
The confusion matrix and intermediate metrics for each class are detailed in the table below.
Emotion | TP | FP | FN | TN | Precision | Recall | F1 score |
---|---|---|---|---|---|---|---|
anger | 0 | 2 | 0 | 8 | |||
fear | 0 | 0 | 1 | 9 | |||
happiness | 1 | 3 | 2 | 4 | |||
neutral | 0 | 0 | 2 | 8 | |||
sadness | 0 | 2 | 3 | 5 | |||
surprise | 1 | 1 | 0 | 8 |
The final F1-macro and accuracy:
Post-processing
Due to the generative nature of the LLMs, the models tend to generate descriptive responses instead of a single-word output corresponding to the predicted emotional state. To ensure that the systems are not penalized for minor errors, such as using the incorrect part of speech or responding with a complete sentence, we provide a script utilizing the post-processing strategy introduced in [15].
To use the script, run the following command:
python process_outputs.py <path_to_input_file> <path_to_output_file>
The process_outputs.py
script takes two positional arguments - path to the input TSV file with the outputs from a model, and path to the output TSV file, where the model’s responses will be converted to the labels corresponding to the emotional states, according to the following strategy:
-
If a generated response is not an exact match, it is normalized and split into a list of individual words.
-
Then, for each target label, the Levenshtein similarity score between the label and each word in the generated response is computed.
-
Similarity scores below a predefined threshold of 0.57 are filtered out for each label.
-
The remaining scores are summed to yield an aggregated score for the given label.
-
The label with the highest aggregated similarity score is selected as the best match from all valid labels.
Additional details on the post-processing strategy, as well as an example, are available in [15].
Baseline
Qwen2-Audio-7B-Instruct
dev | test-A | test-B | |
---|---|---|---|
F1-macro | 0.1829 | 0.1372 | |
Accuracy | 0.2160 | 0.1883 |
References
-
I. Christop. 2024. nEMO: Dataset of Emotional Speech in Polish.
-
P. Gournay, O. Lahaie, and R. Lefebvre. 2018. A canadian french emotional speech dataset.
-
H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma. 2014. CREMA-D: Crowd-sourced emotional multimodal actors dataset.
-
K. A. Noriy, X. Yang, and J. J. Zhang. 2023. EMNS /Imz/ Corpus: An emotive single-speaker dataset for narrative storytelling in games, television and graphic novels.
-
F. Catania, J. W. Wilke, and F. Garzotto. 2020. Emozionalmente: A Crowdsourced Corpusof Simulated Emotional Speech in Italian.
-
O. Martin, I. Kotsia, B. Macq, and I. Pitas. 2006. The eNTERFACE’05 Audio-Visual Emotion Database.
-
J. James, L. Tian, and C. I. Watson. 2018. An Open Source Emotional Speech Corpus for Human Robot Interaction Applications.
-
M. M. Duville, L. M. Alonso-Valerdi, and D. I. Ibarra-Zarate. 2021. The Mexican Emotional Speech Database (MESD): elaboration and assessment based on machine learning.
-
M. M. Duville, L. M. Alonso-Valerdi, and D. I. Ibarra-Zarate. 2021. Mexican Emotional Speech Database Based on Semantic, Frequency, Familiarity, Concreteness, and Cultural Shaping of Affective Prosody.
-
L. Kerkeni, C. Cleder, Y. Serrestou, and K. Raoof. 2020.French Emotional Speech Database - Oréau.
-
I. Steiner, M. Schröder, and A. Klepp. 2013. The PAVOQUE corpus as a resource for analysis and synthesis of expressive speech.
-
S. R. Livingstone and F. A. Russo. 2018. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English.
-
А. Аментес, И. Лубенец, and Н. Давидчук. 2022. Открытая библиотека искусственного интеллекта для анализа и выявления эмоциональных оттенков речи человека.
-
S. Sultana, M. S. Rahman, M. R. Selim, and M. Z. Iqbal. 2021. SUST Bangla Emotional Speech Corpus (SUBESCO): An audio-only emotional speech corpus for Bangla.
-
I. Christop and M. Czajka. 2025. CAMEO: Collection of Multilingual Emotional Speech Corpora.