首页> 外文会议>ACM workshop on searching spontaneous conversational speech 2010 >Story Segmentation for Speech Transcripts in Sparse Data Conditions
【24h】

Story Segmentation for Speech Transcripts in Sparse Data Conditions

机译:稀疏数据条件下语音笔录的故事分割

获取原文
获取原文并翻译 | 示例

摘要

Information Retrieval systems determine relevance by comparing information needs with the content of potential retrieval units. Unlike most textual data, automatically generated speech transcripts cannot by default be easily divided into obvious retrieval units due to a lack of explicit structural markers. This problem can be addressed by automatically detecting topically cohesive segments, or stories. However, when the content collection consists of speech from less formal domains than broadcast news, most of the standard automatic boundary detection methods are potentially unsuitable due to their reliance on learned features. In particular for conversational speech, the lack of adequate training data can present a significant issue. In this paper four methods for automatic segmentation of speech transcriptions are compared. These are selected because of their independence from collection specific knowledge and implemented without the use of training data. Two of the four methods are based on existing algorithms, the others are novel approaches based on a dynamic segmentation algorithm (QDSA) that incorporates information about the query, and WordNet. Experiments were done on a task similar to TREC SDR unknown boundaries condition. For the best performing system, QDSA, the retrieval scores for a tfidf-type ranking function were equivalent to a reference segmentation, and improved through document length normalization using the 6m25/Okapi method. For the task of automatically segmenting speech transcripts for use in information retrieval, we conclude that a training-poor processing paradigm which can be crucial for handling surprise data is feasible.
机译:信息检索系统通过将信息需求与潜在检索单元的内容进行比较来确定相关性。与大多数文本数据不同,默认情况下,由于缺少明确的结构标记,因此自动生成的语音成绩单无法轻易分为明显的检索单位。可以通过自动检测局部内聚的片段或故事来解决此问题。但是,当内容收集由来自广播新闻以外的非正式领域的语音组成时,大多数标准自动边界检测方法由于它们依赖于学习的功能而可能不合适。特别是对于会话语音,缺少足够的培训数据可能会带来严重的问题。本文比较了四种自动分割语音转录的方法。选择它们的原因是它们独立于收集特定知识,并且无需使用培训数据即可实施。四种方法中的两种基于现有算法,其他两种都是基于动态分段算法(QDSA)的新颖方法,该算法结合了有关查询和WordNet的信息。实验是在类似于TREC SDR未知边界条件的任务上完成的。对于性能最佳的系统QDSA,tfidf类型排序功能的检索分数等同于参考分割,并通过使用6m25 / Okapi方法进行文档长度归一化来提高。对于自动分段语音记录以用于信息检索的任务,我们得出结论,训练不足的处理范例对于处理突发数据至关重要,这是可行的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号