Task 3: Entity linking

Task definition

The task covers the identification of mentions of entities from a knowledge base (KB) in Polish texts. In this task as the reference KB we will use WikiData (WD), an offspring of Wikipedia – a knowledge base, that unifies structured data available in various editions of Wikipedia.

For instance the following text:


Zaginieni 11-latkowie w środę rano wyszli z domów do szkoły w Nowym Targu, gdzie przebywali do godziny 12:00. Jak informuje "Tygodnik Podhalański", 11-letni Ivan już się odnalazł, ale los Mariusza Gajdy wciąż jest nieznany. Chłopcy od chwili zaginięcia przebywali razem między innymi w Zakopanem. Mieli się rozstać w czwartek rano.
Source: gazeta.pl

has 3 entity mentions:

Even though there are more mentions that have their corresponding entries in WD (such as “środa”, “dom”, “12:00”, etc.) we will restrict the set of entities to a closed group of WD types, roughly corresponding to the types commonly found in Named Entity Recognition (with important exclusion of times and dates). It also should be noted that names such as “Ivan” and “Mariusz Gajda” should not be annotated, since they lack corresponding entries in WD.

The task is similar to Named Entity Recognition (NER), with the important difference that in EL the set of entities is closed. To some extent EL is also similar to Word Sense Disambiguation (WSD), since mentions are ambiguous between competing entities.

In this task we have decided to ignore nested mentions of entities, so names such as “Zespół Szkół Łączności im. Obrońców Poczty Polskiej w Gdańsku, w Krakowie”, which has an entry in WikiData should be treated as an atomic linguistic unit, even though there are many entities that have their corresponding WikiData entries (such as Poczta Polska w Gdańsku, Gdańsk, Kraków). Also the algorithm is required to identify all mentions of the entity in the given document, even if they are exactly same as the previous mentions.

The results will be evaluated using F1 score. Exact and partial matches will be scored the same as in the NER task from 2018.

Training data

The most common training data used in EL is Wikipedia itself. Even though it wasn’t designed as a reference corpus for that task, the structure of internal links serves as a good source for training and testing data, since the number of links inside Wikipedia is counted in millions. The important difference between the Wikipedia links and EL to WikiData is the fact that the titles of the Wikipedia articles evolve, while the WD identifiers remain constant. We will soon provide a portion of Wikipedia text with reference mapping of the titles into WD entities. Still it is fairly easy to obtain, since most of WD entries include a link to at least one Wikipedia.

The second important difference is the fact that according to the Wikipedia editing rules, a link should be provided only for the first mention of any salient concept present in an article. It is different from the requirements of the task in which all mentions have to be identified.

The training data consists of the tokenised and sentence-split Wikipedia text (DOWNLOAD) as well as Wikidata items (DOWNLOAD). The sentences are separated by an empty line. Each line contains the following data:

[doc_id, token, preceding_space, link_title, entity_id]

  • doc_id – an internal Wikipedia identifier of the article; it may be used to disambiguate entities collectively in a single document (by using internal coherence of entity mentions),
  • token – the value of the token,
  • preceding_space – 1 indicates that the token was preceded by a blank character (space in the most of the cases), 0 otherwise,
  • link_title – the title of the Wikipedia article that is a target of an internal link containing given token; some of the links point to articles that do not exist in Wikipedia; _ (underscore) is used when the token is not part of a link,
  • entity_id – the ID of the entity in Wikidata; this value has to be determined by the algorithm; _ (underscore) is used when the ID could not be established.

Sample data
2 Nazwa 1 _ _
2 języka 1 _ _
2 pochodzi 1 _ _
2 od 1 _ _
2 pierwszych 1 _ _
2 liter 1 _ _
2 nazwisk 1 _ _
2 jego 1 _ _
2 autorów 1 _ _
2 Alfreda 1 Alfred V. Aho Q62898
2 V 1 Alfred V. Aho Q62898
2 . 0 Alfred V. Aho Q62898
2 Aho 1 Alfred V. Aho Q62898
2 , 0 _ _
2 Petera 1 Peter Weinberger _
2 Weinbergera 1 Peter Weinberger _
2 i 1 _ _
2 Briana 1 Brian Kernighan Q92608
2 Kernighana 1 Brian Kernighan Q92608
2 i 1 _ _
2 czasami 1 _ _
2 jest 1 _ _
2 zapisywana 1 _ _
2 małymi 1 _ _
2 literami 1 _ _
2 oraz 1 _ _
2 odczytywana 1 _ _
2 jako 1 _ _
2 jedno 1 _ _
2 słowo 1 _ _
2 awk 1 _ _
2 . 0 _ _

[Alfred V. Aho] and [Brian Kernighan] have their corresponding Wikidata IDs, since it was possible to determine them using the Wikipedia and Wikidata datasets. Peter Weinberger does not have the ID, even though there is an entry in Wikidata about him. Yet, there is no such article in the Polish Wikipedia and the link could not be established automatically. In the test set only the items that have the corresponding Polish Wikipedia articles will have to be determined. Moreover, the algorithm will only have to determine the target of the link, not the span, so for the previous example, the test data will look as follows (the fourth column is superfluous but kept for compatiblity with the training data):

2 Nazwa 1 _ _
2 języka 1 _ _
2 pochodzi 1 _ _
2 od 1 _ _
2 pierwszych 1 _ _
2 liter 1 _ _
2 nazwisk 1 _ _
2 jego 1 _ _
2 autorów 1 _ _
2 Alfreda 1 _ e1
2 V 1 _ e1
2 . 0 _ e1
2 Aho 1 _ e1
2 , 0 _ _
2 Petera 1 _ _
2 Weinbergera 1 _ _
2 i 1 _ _
2 Briana 1 _ e2
2 Kernighana 1 _ e2
2 i 1 _ _
2 czasami 1 _ _
2 jest 1 _ _
2 zapisywana 1 _ _
2 małymi 1 _ _
2 literami 1 _ _
2 oraz 1 _ _
2 odczytywana 1 _ _
2 jako 1 _ _
2 jedno 1 _ _
2 słowo 1 _ _
2 awk 1 _ _
2 . 0 _ _

Thus the algorithm will be informed about 2 mentions, one spanning [Alfred V. Aho] and another spanning [Briana Kernighana]. It should be noted that in the test data mentions linking to the same entities will have separate mention IDs, unless they form a continuous span of tokens.

Test data

The test corpus will not include any texts from Wikipedia. The input and the output format will be based on CONLL-U. We will provide tokenization, lemmatization and morphosyntactic tags for the input data. They will be obtained using one of the leading taggers for Polish (we will let you know in advance which tagger is used).

References

Raiman, J.R. and Raiman, O.M., 2018, April. DeepType: multilingual entity linking by neural type system evolution. In Thirty-Second AAAI Conference on Artificial Intelligence.

Milne, D. and Witten, I.H., 2013. An open-source toolkit for mining Wikipedia. Artificial Intelligence, 194, pp.222-239.

Mendes, P.N., Jakob, M., García-Silva, A. and Bizer, C., 2011, September. DBpedia spotlight: shedding light on the web of documents. In Proceedings of the 7th international conference on semantic systems (pp. 1-8). ACM.

Pohl, A., 2013, September. Knowledge-based named entity recognition in Polish. In Computer Science and Information Systems (FedCSIS), 2013 Federated Conference on (pp. 145-151). IEEE.

Pohl, A., 2012, September. Improving the Wikipedia Miner word sense disambiguation algorithm. In Computer Science and Information Systems (FedCSIS), 2012 Federated Conference on (pp. 241-248). IEEE.