Task 4: eXtreme eXtraction.pl - information extraction and entity typing from long documents with complex layouts

Task definition

eXtreme eXtraction challenge concerns information extraction and inference field of Natural Language Processing. Information gathering from real-life, long documents must deal with the complex layout of pages by integrating found entities along multiple pages and text sections, tables, plots, forms etc. To encourage progress on deeper and more complex information extraction, we present a novel dataset in which systems must find from formal documents the most important information about various types of entities. These entities are not only classes from standard named entity recognition (NER) systems (e.g. person, localisation, or organisation) but also the roles of the entities in the whole documents (e.g. CEO, issue date).

Data description

Data consists of two folders (train and validate). Each of these folders contains a `.csv` file with ground truth values for each report. For each report there are `.pdf` (raw input), `.txt` (text input) and `.hocr` (text input with positional info) files placed in the `reports/{report_id}` folder.

The dataset contains 2216 unique records.

SA;2012-08-30;2012-01-01;2012-06-30;39-100;Ropczyce;ul. Przemysłowa;1;[('2012-08-30',
'Józef Siwiec', 'Prezes Zarządu'), ('2012-08-30', 'Marian Darłak', 'Wiceprezes Zarządu'),
('2012-08-30', 'Robert Duszkiewicz', 'Wiceprezes Zarządu')]
118734;PC GUARD S.A.;2009-08-31;2009-01-01;2009-06-30;60-467;Poznań;Jasielska
16;16;[('2009-08-31', 'Dariusz Grześkowiak', 'Prezes Zarządu'), ('2009-08-31', 'Mariusz Bławat',
'Członek Zarządu')]

Data description

The `ground_truth.csv` csv files included in the directories contain the following columns:

  • *id* - unique identifier of a specific financial report
  • *company* - name of the company
  • *drawing_date* - date which specifies when the financial report was submitted
  • *period_from* - start of the obligation period
  • *period_to* - end of the obligation period
  • *postal_code* - postal code of the company
  • *city* - the city where the company is registered
  • *street* - the name of the street where the company is registerd
  • *street_no* - the number of the street at which the company is registered
  • *people* - members/chairmen of the company management. A cell contains a list of tuples.

Where each tuple has the following form: (<date of signature>, <name and surname>,
<position>) e.g. ('2019-12-16', 'Jan Kowalski', 'Prezes Zarządu')
For each column no more than 15% of documents can have incorrect ground truth value.


Prec, Recall, F1, Likelihood

Coverage in the whole dataset:

company-present 88.16
street-present 88.99
drawing_date-present 93.40
postal_code-present 94.70
city-present 98.85
street_no-present 98.92
period_from-present 99.57
period_to-present 99.75
people-present 100.00