Task 1: Punctuation prediction from conversational language

Restore punctuation marks from the output of an ASR system.

Authors

Piotr Pęzik, Adam Wawrzyński, Michał Adamczyk, Sylwia Karasińska, Wojciech Janowski, Agnieszka Mikołajczyk, Filip Żarnecki, Patryk Neubauer and Anna Cichosz

Video introduction

Motivation

Speech transcripts generated by Automatic Speech Recognition (ASR) systems typically do not contain any punctuation or capitalization. In longer stretches of automatically recognized speech, lack of punctuation affects the general clarity of the output text [1]. The primary purpose of punctuation restoration (PR), punctuation prediction (PP), and capitalization restoration (CR) as a distinct natural language processing (NLP) task is to improve the legibility of ASR-generated text and possibly other types of texts without punctuation. For the purposes of this task, we define PR as restoration of originally available punctuation from read speech transcripts (which was the goal of a separate task in the PolEval 2021 competition) [2] and PP as prediction of possible punctuation in transcripts of spoken/ conversational language. Aside from their intrinsic value, PR, PP, and CR may improve the performance of other NLP aspects such as Named Entity Recognition (NER), part-of-speech (POS), and semantic parsing or spoken dialog segmentation [3, 4].

One of the challenges of developing PP models for conversational language is the availability of consistently annotated datasets. The very nature of naturally-occurring spoken language makes it difficult to identify exact phrase and sentence boundaries [5,6], which means that dedicated guidelines are required to train and evaluate punctuation models.

The goal of the present task is to provide a solution for predicting punctuation in the test set collated for this task. The test set consists of time-aligned ASR dialogue transcriptions from three sources:

  1. CBIZ, a subset of DiaBiz, a corpus of phone-based customer support line dialogs (https://clarin-pl.eu/dspace/handle/11321/887)
  2. VC, a subset of transcribed video-communicator recordings
  3. SPOKES, a subset of the Spokes corpus [7]

Table 1 below summarizes the size of the three subsets in terms of dialogs, words and duration of recordings.

SubsetCorpusFilesWordsAudio [s]SpeakersLicense
CBIZ DiaBiz 69 36 250 16 916 14 CC-BY-SA-NC-ND
VC Video conversations 8 44 656 17 123 20 CC-BY-NC
Spokes Casual conversations 13 42 730 20 583 19 CC-BY-NC

Table 1. Overall statistics of the corpus.

The full dataset has been split into three subsets as summarized in Table 2 below.

SetFilesWordsAudio [s]License
Train 69 98 095 44 030 CC-BY-SA-NC-ND
Dev 11 12 563 4 718 CC-BY-NC
Test 10 12 978 5 874 CC-BY-NC

Table 2. Training / development / test set statistics.

The punctuation annotation guidelines were developed in the CLARIN-BIZ project by Karasińska et al. [10].

Participants are encouraged to use both text-based and speech-derived features to identify punctuation symbols (e.g. multimodal framework [8]) or to predict casing along with punctuation [9]. We allow using the punctuation dataset available from: PolEval 2021: Task 1. Punctuation restoration from read text [2].

Task description

The workflow of this task is illustrated in Fig 1 below. Given raw ASR output, the task is to predict punctuation in annotated ASR transcripts of conversational speech.

alt_text

Fig. 1 Overview of the punctuation prediction task.

The punctuation marks evaluated as part of the task are listed in Table 3 below. Blanks are marked as spaces. The distribution of explicit punctuation symbols in the training and development portion of the dataset provided is shown in Tables 3-6.

 SymbolMean per dialogMedianMaxSum
fullstop . 111.15 59 1 157 8 892
comma , 161.51 69 1 738 12 921
question_mark ? 24.36 11 229 1 949
exclamation_mark ! 3.46 4 45 277
hyphen - 0.64 25 50 51
ellipsis 63.28 11 1 833 5 062
words   1 383.23 569 16 528 110 658

Table 3. Punctuation for raw text (all subcorpora)

 SymbolMean per dialogMedianMaxSum
fullstop . 58.06 54 213 3 600
comma , 70.61 59 388 4 378
question_mark ? 11.26 10 35 698
exclamation_mark ! 0.34 1 5 21
hyphen - 0.02 1 1 1
ellipsis 12.29 9 54 762
words   528.74 483 2 180 32 782

Table 4. Punctuation for raw text (CBIZ)

 SymbolMean per dialogMedianMaxSum
fullstop . 411.86 384 1 157 2 883
comma , 737.86 577 1 738 5 165
question_mark ? 85.29 41 229 597
exclamation_mark ! 10.43 5 43 73
hyphen - / / / /
ellipsis 514.00 365 1 833 3 598
words   5 704.14 4 398 9 469 39 929

Table 5. Punctuation for raw text (VC)

 SymbolMean per dialogMedianMaxSum
fullstop . 219.00 193 607 2 409
comma , 307.09 313 614 3 378
question_mark ? 59.45 39 150 654
exclamation_mark ! 16.64 10 45 183
hyphen - 4.55 50 50 50
ellipsis 63.82 45 186 702
words   3 449.73 1 966 16 528 37 947

Table 6. Punctuation for raw text (Spokes)

Data format

We provide two types of data: text and audio data. Text data is provided in the TSV format. For Audio data we provide audio files as WAV files and transcripts with force-aligned timestamps.

WAVE files are provided as a separate download:

Text data

The datasets are encoded in the TSV format.

Field descriptions:

  • column 1: name of the audio file
  • column 2: unique segment id
  • column 3: segment text, where each word is separated by a single space

The segment text (column 3) format is:

  • single word text:word start timestamp in ms-word end timestamp in ms

Evaluation procedure

The baseline results will be provided in the final evaluation.

Submission format

Results are to be submitted as plain text file, where each line corresponds to a single segment. The text should include the predicted punctuation marks.

Metrics

Final results are evaluated in terms of precision, recall, and F1 scores for predicting each punctuation mark separately. Submissions are compared with respect to the weighted average of F1 scores for each punctuation sign. The method of evaluation is similar to the one used in the previous PolEval competition, PolEval 2021: Task 1. Punctuation restoration from read text [2]

Per-document score:

alt_text

Global score per punctuation sign p:

alt_text

Final scoring metric calculated as weighted average of global scores per

alt_text

We would like to invite participants to a discussion about evaluation metrics, taking into account such factors as

  • ASR and Forced-Alignment errors,
  • inconsistencies among annotators,
  • the impact of only slight displacement of punctuation,
  • assigning different weights to different types of errors.

Metadata

Tags: poleval-2022, asr

References

  1. Yi, J., Tao, J., Bai, Y., Tian, Z., & Fan, C. (2020). Adversarial transfer learning for punctuation restoration. arXiv preprint arXiv:2004.00248.
  2. Mikołajczyk, A., Wawrzynski, A., Pezik, P., Adamczyk, M., Kaczmarek, A., & Janowski, W. PolEval 2021 Task 1: Punctuation Restoration from Read Text. Proceedings ofthePolEval2021Workshop, 21.
  3. Nguyen, Thai Binh, et al. “Improving Vietnamese Named Entity Recognition from Speech Using Word Capitalization and Punctuation Recovery Models.” Proc. Interspeech 2020 (2020): 4263-4267.
  4. Hlubík, Pavel, et al. “Inserting Punctuation to ASR Output in a Real-Time Production Environment.” International Conference on Text, Speech, and Dialogue. Springer, Cham, 2020.
  5. Sirts, Kairit, and Kairit Peekman. “Evaluating Sentence Segmentation and Word Tokenization Systems on Estonian Web Texts.” Human Language Technologies–The Baltic Perspective: Proceedings of the Ninth International Conference Baltic HLT 2020. Vol. 328. IOS Press, 2020.
  6. Wang, Xueyujie. “Analysis of Sentence Boundary of the Host’s Spoken Language Based on Semantic Orientation Pointwise Mutual Information Algorithm.” 2020 12th International Conference on Measuring Technology and Mechatronics Automation (ICMTMA). IEEE, 2020.
  7. Pezik, P. (2014, October). Spokes-a search and exploration service for conversational corpus data. In Selected papers from the CLARIN 2014 Conference (pp. 99-109).
  8. Sunkara, Monica, et al. “Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech.” arXiv preprint arXiv:2008.00702 (2020).
  9. Pappagari, R., Żelasko, P., Mikołajczyk, A., Pęzik, P., & Dehak, N. (2021). Joint prediction of truecasing and punctuation for conversational speech in low-resource scenarios. arXiv preprint arXiv:2109.06103.
  10. Karasińska S., Cichosz, A. Adamczyk M., Pęzik P. Evaluating punctuation prediction in conversational language. Forthcoming.