Volltext-Downloads (blau) und Frontdoor-Views (grau)

Linear segmentation of ASR transcripts and text by topic

  • The recent explosion of available audio-visual media is the new challenge for information retrieval research. Audio speech recognition systems translate spoken content to the text domain. There is a need for searching and indexing this data which possesses no logical structure. One possible way to structure it on a high level of abstraction is by finding topic boundaries. Two unsupervised topic segmentation methods were evaluated with real-world data in the course of this work. The first one, TSF, models topic shifts as fluctuations in the similarity function of the transcript. The second one, LCSeg, approaches topic changes as places with the least overlapping lexical chains. Only LCSeg performed close to a similar real-world corpus. Other reported results could not be outperformed. Topic analysis based on the repeated word usage models renders topic changes more ambiguous than expected. This issue has more impact on the segmentation quality than the state-of-the-art ASR word error rate. It could be concluded that it is advisable to develop topic segmentation algorithms with real-world data to avoid potential biases to artificial data. Unlike evaluated approaches based on word usage analysis, methods operating with local contexts can be expected to perform better through emulation of semantic dependencies.

Export metadata

Additional Services

Search Google Scholar Check availability


Show usage statistics
Document Type:Master's Thesis
Author:Peter Muryshkin
Number of pages:X, 83, VIII
Referee:Wolfgang Heiden, Peter Becker, Sebastian Tschöpel
Publisher:Fraunhofer Publica
Granting Institution:Hochschule Bonn-Rhein-Sieg, Fachbereich Informatik
Contributing Corporation:Fraunhofer-Institut Intelligente Analyse- und Informationssysteme
Publication year:2011
Dewey Decimal Classification (DDC):0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Theses, student research papers:Hochschule Bonn-Rhein-Sieg / Fachbereich Informatik
Entry in this database:2018/11/10