Task 1: Punctuation prediction from conversational language
Speech transcripts generated by Automatic Speech Recognition (ASR) systems typically do not contain any punctuation or capitalization. In longer stretches of automatically recognized speech, lack of punctuation affects the general clarity of the output text. The primary purpose of punctuation restoration (PR), punctuation prediction (PP), and capitalization restoration (CR) as a distinct natural language processing (NLP) task is to improve the legibility of ASR-generated text and possibly other types of texts without punctuation. For the purposes of this task, we define PR as restoration of originally available punctuation from read speech transcripts (which was the goal of a separate task in the PolEval 2021 competition)  and PP as prediction of possible punctuation in transcripts of spoken/ conversational language. Aside from their intrinsic value, PR, PP, and CR may improve the performance of other NLP aspects such as Named Entity Recognition (NER), part-of-speech (POS), and semantic parsing or spoken dialog segmentation.
Task 2: Abbreviation disambiguation
Abbreviations are often overlooked in many NLP pipelines. However, they are still an important point to tackle, especially in such applications as machine translation, named entity recognition, or text-to-speech systems.
Task 3: Passage retrieval
Passage Retrieval is a crucial part of modern open-domain question-answering systems that rely on precise and efficient retrieval components to find passages containing correct answers.
Traditionally, lexical methods like TF-IDF or BM25 were commonly used to power the retrieval systems. They are fast, interpretable, and don’t require any training (and therefore a training set). However, they can only return a document if it contains a keyword present in a query. Moreover, their text understanding is limited because they ignore the word order.