Task 1: Punctuation restoration from read text

Speech transcripts generated by Automatic Speech Recognition (ASR) systems typically do not contain any punctuation or capitalization. In longer stretches of automatically recognized speech, the lack of punctuation affects the general clarity of the output text. The primary purpose of punctuation (PR) and capitalization restoration (CR) as a distinct natural language processing (NLP) task is to improve the legibility of ASR-generated text, and possibly other types of texts without punctuation. Aside from their intrinsic value, PR and CR may improve the performance of other NLP aspects such as Named Entity Recognition (NER), part-of-speech (POS) and semantic parsing or spoken dialog segmentation.

 


Task 2: Evaluation of translation quality assessment metrics

The task is to investigate metrics for automatic evaluation of machine translation results or other similar data types. We have prepared translations made from English to Polish along with reference translations made by a human - a Polish native speaker. In the task we are looking for automatic metrics for evaluation. The task will be evaluated at the level of segments often similar to sentences. The results of the task will be a calculated correlations of the submitted scores with the human evaluations performed manually. The task will be evaluated at the level of segments often similar to sentences. The result of the task will be a calculated correlation of the submitted scores with the human evaluations performed manually.


Task 3: Post-correction of OCR results

Although Optical Character Recognition methods have a long history of development, their performance is still limited, especially in the case of low-quality source material, historical texts and non-standard document layouts. The extent to which such methods may be improved based on methods relying on image analysis is also limited, as some parts of source documents might simply be unrecognizable in low-quality training data. In another scenario, the quantity of the training data might not allow to train a representative model for the specific language variant used in a set of documents.