Story Segmentation for Speech Transcripts in Sparse Data Conditions

机译：稀疏数据条件下语音笔录的故事分割

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Information Retrieval systems determine relevance by comparing information needs with the content of potential retrieval units. Unlike most textual data, automatically generated speech transcripts cannot by default be easily divided into obvious retrieval units due to a lack of explicit structural markers. This problem can be addressed by automatically detecting topically cohesive segments, or stories. However, when the content collection consists of speech from less formal domains than broadcast news, most of the standard automatic boundary detection methods are potentially unsuitable due to their reliance on learned features. In particular for conversational speech, the lack of adequate training data can present a significant issue. In this paper four methods for automatic segmentation of speech transcriptions are compared. These are selected because of their independence from collection specific knowledge and implemented without the use of training data. Two of the four methods are based on existing algorithms, the others are novel approaches based on a dynamic segmentation algorithm (QDSA) that incorporates information about the query, and WordNet. Experiments were done on a task similar to TREC SDR unknown boundaries condition. For the best performing system, QDSA, the retrieval scores for a tfidf-type ranking function were equivalent to a reference segmentation, and improved through document length normalization using the 6m25/Okapi method. For the task of automatically segmenting speech transcripts for use in information retrieval, we conclude that a training-poor processing paradigm which can be crucial for handling surprise data is feasible.

机译：信息检索系统通过将信息需求与潜在检索单元的内容进行比较来确定相关性。与大多数文本数据不同，默认情况下，由于缺少明确的结构标记，因此自动生成的语音成绩单无法轻易分为明显的检索单位。可以通过自动检测局部内聚的片段或故事来解决此问题。但是，当内容收集由来自广播新闻以外的非正式领域的语音组成时，大多数标准自动边界检测方法由于它们依赖于学习的功能而可能不合适。特别是对于会话语音，缺少足够的培训数据可能会带来严重的问题。本文比较了四种自动分割语音转录的方法。选择它们的原因是它们独立于收集特定知识，并且无需使用培训数据即可实施。四种方法中的两种基于现有算法，其他两种都是基于动态分段算法（QDSA）的新颖方法，该算法结合了有关查询和WordNet的信息。实验是在类似于TREC SDR未知边界条件的任务上完成的。对于性能最佳的系统QDSA，tfidf类型排序功能的检索分数等同于参考分割，并通过使用6m25 / Okapi方法进行文档长度归一化来提高。对于自动分段语音记录以用于信息检索的任务，我们得出结论，训练不足的处理范例对于处理突发数据至关重要，这是可行的。

著录项

来源
《ACM workshop on searching spontaneous conversational speech 2010》|2010年|p.33-38|共6页
会议地点 Firenze(IT);Firenze(IT)
作者
Laurens van der Werff;
展开▼
作者单位

University of Twente, HMI group P.O. Box 217 7500AE Enschede, The Netherlands;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词
algorithms; experimentation; performance;

机译：算法;实验性能;
入库时间 2022-08-26 14:19:04

相似文献

外文文献
中文文献
专利

1. Segmentation Strategies for Passage Retrieval from Internet Video using Speech Transcripts [J] . Christian Wartena Journal of digital information management . 2013,第6期

机译：使用语音抄本从互联网视频中检索段落的分段策略
2. A Tool to Solve Sentence Segmentation Problem on Preparing Speech Database for Indonesian Text-to-speech System [J] . Mohammad Teduh Uliniansyah, Gunarso, Elvira Nurfadhilah, Procedia Computer Science . 2016,第1期

机译：为印尼文字转语音系统准备语音数据库时解决句子分割问题的工具
3. Data segmentation and genetic algorithms for sparse data division in Nome placer gold grade estimation using neural network and geostatistics [J] . B. SAMANTA, S. BANDOPADHYAY, R. GANGULI Exploration and Mining Geology . 2005,第1a4期

机译：基于神经网络和地统计学的Nome砂金品位评估中稀疏数据划分的数据分割和遗传算法
4. Story Segmentation for Speech Transcripts in Sparse Data Conditions [C] . Laurens van der Werff ACM workshop on searching spontaneous conversational speech . 2010

机译：稀疏数据条件中语音成绩单的故事分割
5. Rethinking Customer Segmentation and Demand Learning in the Presence of Sparse, Diverse, and Large-Scale Data [D] . Venkataraman, Ashwin. 2018

机译：在稀疏，多样和大规模数据的存在下重新考虑客户细分和需求学习
6. Segmentation of High Dimensional Time-Series Data Using Mixture of Sparse Principal Component Regression Model with Information Complexity [O] . Yaojin Sun, Hamparsum Bozdogan 2020

机译：利用稀疏主成分回归模型与信息复杂性混合的高维时间序列数据分割
7. Segmentation Strategies for Passage Retrieval from Internet Video using Speech Transcripts [O] . Wartena Christian 2013

机译：使用语音抄本从互联网视频中检索段落的分段策略

Story Segmentation for Speech Transcripts in Sparse Data Conditions

摘要

著录项

相似文献

相关主题

期刊订阅