Task 3: Post-correction of OCR results


Although Optical Character Recognition methods have a long history of development, their performance is still limited, especially in the case of low-quality source material, historical texts and non-standard document layouts. The extent to which such methods may be improved based on methods relying on image analysis is also limited, as some parts of source documents might simply be unrecognizable in low-quality training data. In another scenario, the quantity of the training data might not allow to train a representative model for the specific language variant used in a set of documents. For example, in the comparison of recognition results of two popular OCR engines published by the IMPACT project (https://www.digitisation.eu/fileadmin/Tool_Training_Materials/Abbyy/PSNC_Tesseract-FineReader-report.pdf) the recognition rate for historical documents (published before 1850, in Polish) was 80% on character level and ca. 60% on word level. Such accuracy levels are far from satisfactory in most applications as they result in error propagation in further stages of text processing.

On the other hand, we may observe a considerable progress in Natural Language Processing methods, especially in the area of language modeling approaches based on deep neural networks. Models, such as BERT and its variants, prove to increase the accuracy of many NLP-related tasks, such as named entity recognition, fake news detection, or question answering. The motivation of this task is thus to evaluate the possible improvement in OCR results by using modern NLP approaches to language error correction. In a similar task (https://sites.google.com/view/icdar2019-postcorrectionocr), the improvement of OCR results reported by the best overall performer submitted for the challenge ranged from 6% for Spanish to 24% for German (https://drive.google.com/file/d/15mxNO-M9PiXBnffi7MOa8wUw33nj1xBp/view). For some specific languages and submitted methods the results were even better, such as 44% for Finnish and 26% for French.