Identifying Introductions in Podcast Episodes from Automatically Generated Transcripts

Jing, Elise; Schneck, Kristiana; Egan, Dennis; Waterman, Scott A.

Computer Science > Computation and Language

arXiv:2110.07096v1 (cs)

[Submitted on 14 Oct 2021 (this version), latest version 9 Dec 2021 (v2)]

Title:Identifying Introductions in Podcast Episodes from Automatically Generated Transcripts

Authors:Elise Jing, Kristiana Schneck, Dennis Egan, Scott A. Waterman

View PDF

Abstract:As the volume of long-form spoken-word content such as podcasts explodes, many platforms desire to present short, meaningful, and logically coherent segments extracted from the full content. Such segments can be consumed by users to sample content before diving in, as well as used by the platform to promote and recommend content. However, little published work is focused on the segmentation of spoken-word content, where the errors (noise) in transcripts generated by automatic speech recognition (ASR) services poses many challenges. Here we build a novel dataset of complete transcriptions of over 400 podcast episodes, in which we label the position of introductions in each episode. These introductions contain information about the episodes' topics, hosts, and guests, providing a valuable summary of the episode content, as it is created by the authors. We further augment our dataset with word substitutions to increase the amount of available training data. We train three Transformer models based on the pre-trained BERT and different augmentation strategies, which achieve significantly better performance compared with a static embedding model, showing that it is possible to capture generalized, larger-scale structural information from noisy, loosely-organized speech data. This is further demonstrated through an analysis of the models' inner architecture. Our methods and dataset can be used to facilitate future work on the structure-based segmentation of spoken-word content.

Comments:	Accepted in PodRecs 2021, a RecSys workshop
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2110.07096 [cs.CL]
	(or arXiv:2110.07096v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2110.07096

Submission history

From: Elise Jing [view email]
[v1] Thu, 14 Oct 2021 00:34:51 UTC (1,830 KB)
[v2] Thu, 9 Dec 2021 23:43:43 UTC (1,826 KB)

Computer Science > Computation and Language

Title:Identifying Introductions in Podcast Episodes from Automatically Generated Transcripts

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Identifying Introductions in Podcast Episodes from Automatically Generated Transcripts

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators