Task 2: Morphosyntactic tagging of Middle, New and Modern Polish

Task definition

Morphosyntactic disambiguation is one of the most classic NLP problems. For nearly ten years the development and evaluation of morphosyntactic taggers for Polish were focused on the same dataset, namely NKJP1M. Our shared task provides an opportunity to build new systems or tune existing ones to test them in a slightly different environment of more diverse and less standardised historical data. Although the data may seem to be unusual and atypical for everyday applications of NLP tools, the best performing solutions may be deployed in a growing number of projects aimed at building historical corpora of various periods of Polish.

The data for this year’s task covers 400 years of the development of the Polish language. Text samples were drawn from three manually annotated corpora: KorBa — a corpus of 17th and 18th century, a corpus of 19th century, and 1M subcorpus of the National Corpus of Polish NKJP. The corpora represent three different periods of development of Polish: Middle, New and Modern.

All the texts were marked using a historical tagset, which is similar to the tagset of Morfeusz SGJP with some differences, e.g.: there are three values for the number category: singular, dual (Dwie żabie upragnione po polach biegały), and plural; there is a special flexeme adjb for historical “short” forms of adjectives and participles (rówien, pogrzebion, pięknę, swoję, …), and so on. The text is represented as a directed acyclic graph of interpretations, as returned by Morfeusz.

What we find interesting in this task is that the texts are not homogenous since the language changes. In fact, 17th century texts can be considered to represent a different (yet closely related) language than contemporary Polish. We provide information on the date of creation of each text and we explicitly ask the participants to take this information into account when building a tagger (a multi-tagger?).

The goal of the task is to disambiguate morphosyntactic interpretations and to guess the interpretation for unknown words — exactly as in Subtask 1A of PolEval 2017.

Data

Each provided file corresponds to a particular text from one of the corpora. The first line of the file contains a time marker for the text. It may be a single number denoting the year on which the text was written (e.g. #1651) or a range (e.g. #1651:1675, meaning that the text was written between 1651 and 1675). Sometimes only the lower limit of the range is known (e.g. #1651: meaning after 1651). A file may contain several text samples separated by empty lines.

Each line contains one interpretation for a segment in 7 column format:

  1. start position for the segment,
  2. end position for the segment,
  3. the segment,
  4. lemma for the corresponding lexeme,
  5. morphosyntactic tag,
  6. the string “nps” if there is no preceding space,
  7. the string “disamb” if this is the correct interpretation selected among variants provided by the morphological analyzer or “disamb_manual” if the corrected interpretation was added by a human. Please note that in the case of manually added segmentation variants all added segments are marked as “manual” even if some of them could be recognized by the analyser in other contexts.

We expect accuracy of submitted taggers to differ depending on the age of a given text. So the score of a tagger is likely to depend on the mix of texts used for evaluation. For that reason we provide three sets of data (train, devel, test) in two variants (disamb, plain). The largest training set contains the data intended for learning. In this set we provide as much data from each period as we have available.

The other two sets — devel and test — are smaller, about 40.000 segments each, and are guaranteed to have similar distribution of texts in time. In these sets 50% of segments is drawn from KorBa, 30% from 19th century, and 20% from contemporary texts. The test set will be used for scoring presented taggers. The development set is meant to provide a preview of what to expect from the test set.

In the “disamb” variant of each data set exactly one interpretation for each segment is marked as correct (in 7th column). The “plain” variant has this column stripped together with all manual interpretations and segmentation variants. The train and devel data sets are provided in both variants. The test set will be provided only in the plain variant.

Example of stripped manual interpretation:
--- in disamb variant ---
36 37 inaczy inaczy adv disamb_manual
36 37 inaczy inaczyć fin:sg:ter:imperf

--- in plain variant ---
36 37 inaczy inaczyć fin:sg:ter:imperf

Example of stripped manual segmentation:
--- in disamb variant ---
271 272 więtszy więtszy adj:sg:nom:m:pos
271 272 więtszy więtszy adj:sg:voc:m:pos
271 273 więtszym więtszy adj:sg:inst:m:pos disamb_manual
271 273 więtszym więtszym ign
272 273 m być aglt:sg:pri:imperf:nwok nps

--- in plain variant ---
271 272 więtszy więtszy adj:sg:nom:m:pos
271 272 więtszy więtszy adj:sg:voc:m:pos
272 273 m być aglt:sg:pri:imperf:nwok nps

Besides the data provided by us, the participants can use any auxiliary data available to the public and on an open license (if in doubt, please ask).

Please note that the data was updated on 2020.01.24!

Please note that the data and the evaluation script were updated (again) on 2020.02.24! The changes should not influence the results.

Training data
  • train-disamb.tar.xz — gold standard training data in 7-column format
  • train-plain.tar.xz — development data without interpretations unknown to the morphological analyzer (6 columns)
Development data
  • devel-disamb.tar.xz — gold standard development data in 7-column format
  • devel-plain.tar.xz — development data without interpretations unknown to the morphological analyzer (6 columns)

Your solution should be able to produce “disamb” files when given “plain” files.

Test data

Will be provided later in the form analogous to the “plain” files of development data.

Evaluation script

Scoring

A contest entry should consist of an archive containing a directory of files named exactly as in the “test” archive above. Your solution should process given files by adding a 7th column with the “disamb” marker where appropriate and adding complete interpretations where guessing is necessary.

The solutions will be scored for accuracy. An interpretation will be considered correct if segment and tag (columns 3, 5) are the same as in the corresponding interpretation marked as gold standard. This means that the choice of the lemma is not scored.

References

Łukasz Kobyliński and Maciej Ogrodniczuk. Results of the PolEval 2017 competition: Part-of-speech tagging shared task. In Zygmunt Vetulani and Patrick Paroubek, editors, Proceedings of the 8th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, pages 362–366, Poznań, Poland, 2017. Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu.

Witold Kieraś, Dorota Komosińska, Emanuel Modrzejewski, Marcin Woliński. Morphosyntactic annotation of historical texts. The making of the baroque corpus of Polish. In Kamil Ekštein and Václav Matoušek, editors, Text, Speech, and Dialogue 20th International Conference, TSD 2017, Prague, Czech Republic, August 27-31, 2017, Proceedings, number 10415 in Lecture Notes in Computer Science, pages 308–316. Springer International Publishing, 2017.

Witold Kieraś and Marcin Woliński. Manually annotated corpus of Polish texts published between 1830 and 1918. In Nicoletta Calzolari et al. editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), pages 3854–3859, Paris, France, 2018. European Language Resources Association (ELRA).

Adam Przepiórkowski, Mirosław Bańko, Rafał L. Górski, and Barbara Lewandowska-Tomaszczyk, editors. Narodowy Korpus Języka Polskiego. Wydawnictwo Naukowe PWN, Warsaw, 2012.