Results

Task 1: Post-editing and rescoring of ASR results

The number of submissions exceeded our expectations! We are amazed at the number of experiments people managed to perform and even if the result wasn't the best, the conclusions will make a great contribution to this area of research. To this end, we would like to encourage all the authors to make at least a short writeup of their chosen method(s).

The task of creating an interesting competition turned out to be more difficult than anticipated. If the ASR system used was too good, the post-editing problem would be too minor and unimpressive. A very bad system, on the other hand, will make the language too difficult to analyze. A compromise was made and the chosen system had a somewhat average word-error-rate (from the tested systems) amounting to 27.6%. For those that decided to use the lattice as their input, they could count on the oracle word-error-rate of 17.7%, presenting a sort of floor for the error rate of the given ASR system. That means that the system likely made many errors which were unrecoverable from the post-editing perspective, therefore an improvement of even a few absolute percentage points makes a significant difference.

The results were very close in a few cases. To make the assessment a bit more interesting we calculated both the error rate compared to the reference (which was not known to the submitters), as well as the single best output (which was known to the submitters). The latter shows the amount of changes the submission made to the original (i.e. how "brave" is the submission, because even a minor amount of changes could yield a decent result).

Without further ado, the results were as follows:

SubmissionAffiliationWER %Changes %
KRS + spaces UJ. AGH 25.9 3.6
KRS UJ. AGH 26.9 1.6
Polbert https://skok.ai/ 26.9 2.1
BiLSTM-CRF edit-operations tagger Adam Mickiewicz University 24.7 6.2
base-4g-rr Samsung R&D Institute Poland 27.7 2.0
t-REx_k10 Uniwersytet Wrocławski 24.9 14.2
t-REx_k5 Uniwersytet Wrocławski 25.0 14.2
t-REx_fbs Uniwersytet Wrocławski 24.31 17.2
PJA_CLARIN_1k Polish-Japanese Academy of Information Technology 33.5 9.1
PJA_CLARIN_10k Polish-Japanese Academy of Information Technology 32.0 9.6
PJA_CLARIN_20k Polish-Japanese Academy of Information Technology 31.8 9.9
PJA_CLARIN_40k Polish-Japanese Academy of Information Technology 31.8 10.3
PJA_CLARIN_50k Polish-Japanese Academy of Information Technology 31.8 10.2
CLARIN_SEJM_40k Polish-Japanese Academy of Information Technology 33.7 19.1
CLARIN_SEJM_50k Polish-Japanese Academy of Information Technology 32.5 17.7
MLM+bert_base_polish   73.9 2.1
tR-Ex_xk Uniwersytet Wrocławski. Instytut Informatyki 25.7 18.1
tR-Ex_fbs Uniwersytet Wrocławski. Instytut Informatyki 24.31 17.2
tR-Ex_fx Uniwersytet Wrocławski. Instytut Informatyki 25.0 23.3
tR-Ex_kxv2 Uniwersytet Wrocławski. Instytut Informatyki 25.5 17.1

The submission titled "MLM+bert_base_polish" only had 171 out of 462 files submitted and cannot be directly compared with others. The result for only the present files was 26.8%. If we extract these 171 files from the winning submission, its result would yield 23.4%, so this submission would have lost in this comparison as well.

The winning submission by the Wrocław Uniwersity team titled "flair-bigsmall" was submitted twice (both submission were identical) and it also lacked the output for 2 files. The result above is calculated assuming these two files were completely incorrect. If we didn't account for these files, the result would be 24.0%.

It was a close call with between the Wrocław University and Adam Mickiewicz University teams, but ultimately the former team won. We would like to again thank all the teams for participating!


Task 2: Morphosyntactic tagging of Middle, New and Modern Polish

SubmissionAffiliationAccuracyAcc on knownAcc on ignAcc on manual
Alium-1.25   0.8880 0,8985 0,4295 0,2427
Alium-1000   0,8880 0,8985 0,4306 0,2427
KFTT train UJ, AGH 0,9564 0,9600 0,7991 0,6661
KFTT train+devel wo_morf UJ, AGH 0,9563 0,9595 0,8191 0,6730
KFTT train+devel UJ, AGH 0,9573 0,9607 0,8102 0,6781
Simple Baseline: COMBO Allegro.pl, Melodice.org 0,9284 0,9363 0,5838 0,5232
CMC Graph Heuristics Wrocław University of Science and Technology 0,9121 0,9214 0,5072 0,1670
Simple Baselines: XLM-R Allegro.pl, Melodice.org 0,9499 0,9562 0,6770 0,6850

Eight solutions for this task were submitted by four contestants. The results achieved are far better than we anticipated. Tagging of historical Polish can be expected to be more difficult than tagging contemporary language: the tagset includes more features, some of them describing very rare phenomena; the number of tokens unknown to the morphological analyser is larger (2.25% vs. 1.26%); the word order is less stable (with many discontinuous constructions). Yet, the results compare favourably to the results of PolEval 2017 Task 1(A) for contemporary language (http://2017.poleval.pl/index.php/results/). The best overall accuracy is 95.7% compared to 94.6% of PolEval 2017. The most striking improvement lays in tagging tokens unknown to the morphological analyser: 81.9% compared to 67% in PolEval 2017.

These results require a further study, which will hopefully lead to interesting discussions during the PolEval 2020 conference session, but generally we can conclude that the presented systems not only improve on tagging historical texts, but provide better taggers also for contemporary Polish, which is a great achievement.


Task 3: Word sense disambiguation

SubmissionAffiliationPrecision KPWrRecall KPWrPrecision SherlockRecall Sherlock
Polbert for WSD (v2) skok.ai 0.599296 0.588727 0.592263 0.576850
Polbert for WSD skok.ai 0.564432 0.550860 0.564384 0.542966
PolevalWSDv1   0.318547 0.231085 0.291732 0.200867

Task 4: IE and entity typing from long documents with complex layouts

SubmissionAffiliationF1 score
CLEX Wrocław University of Science and Technology 0.651±0.019
double_big Poznan University of Technology; WIZIPISI 0.606±0.017
300_xgb Poznan University of Technology; WIZIPISI 0.592±0.015
double_small Poznan University of Technology; WIZIPISI 0.588±0.018
300_RF Poznan University of Technology; WIZIPISI 0.587±0.015
middle_big Poznan University of Technology; WIZIPISI 0.585±0.016
100_RF Poznan University of Technology; WIZIPISI 0.584±0.016
Multilingual BERT + Random Forest skok.ai 0.440±0.014