首页> 外文期刊>Audio, Speech, and Language Processing, IEEE/ACM Transactions on >Spoken Content Retrieval—Beyond Cascading Speech Recognition with Text Retrieval
【24h】

Spoken Content Retrieval—Beyond Cascading Speech Recognition with Text Retrieval

机译:语音内容检索-超越语音检索与文本检索的级联

获取原文
获取原文并翻译 | 示例

摘要

Spoken content retrieval refers to directly indexing and retrieving spoken content based on the audio rather than text descriptions. This potentially eliminates the requirement of producing text descriptions for multimedia content for indexing and retrieval purposes, and is able to precisely locate the exact time the desired information appears in the multimedia. Spoken content retrieval has been very successfully achieved with the basic approach of cascading automatic speech recognition (ASR) with text information retrieval: after the spoken content is transcribed into text or lattice format, a text retrieval engine searches over the ASR output to find desired information. This framework works well when the ASR accuracy is relatively high, but becomes less adequate when more challenging real-world scenarios are considered, since retrieval performance depends heavily on ASR accuracy. This challenge leads to the emergence of another approach to spoken content retrieval: to go beyond the basic framework of cascading ASR with text retrieval in order to have retrieval performances that are less dependent on ASR accuracy. This overview article is intended to provide a thorough overview of the concepts, principles, approaches, and achievements of major technical contributions along this line of investigation. This includes five major directions: 1) Modified ASR for Retrieval Purposes: cascading ASR with text retrieval, but the ASR is modified or optimized for spoken content retrieval purposes; 2) Exploiting the Information not present in ASR outputs: to try to utilize the information in speech signals inevitably lost when transcribed into phonemes and words; 3) Directly Matching at the Acoustic Level without ASR: for spoken queries, the signals can be directly matched at the acoustic level, rather than at the phoneme or word levels, bypassing all ASR issues; 4) Semantic Retrieval of Spoken Content: trying to retrieve spoken content that is semanti- ally related to the query, but not necessarily including the query terms themselves; 5) Interactive Retrieval and Efficient Presentation of the Retrieved Objects: with efficient presentation of the retrieved objects, an interactive retrieval process incorporating user actions may produce better retrieval results and user experiences.
机译:语音内容检索是指基于音频而不是文本描述直接索引和检索语音内容。这潜在地消除了为索引和检索目的而产生用于多媒体内容的文本描述的需求,并且能够精确地定位期望信息出现在多媒体中的确切时间。语音内容检索已通过使用文本信息检索进行级联自动语音识别(ASR)的基本方法而非常成功地实现:将语音内容转录为文本或点阵格式后,文本检索引擎会在ASR输出上进行搜索以找到所需的信息。当ASR准确性较高时,此框架会很好地工作,但在考虑更具挑战性的现实情况时,此框架将变得不够用,因为检索性能在很大程度上取决于ASR准确性。这一挑战导致了语音内容检索的另一种方法的出现:超越了将ASR与文本检索级联的基本框架,以便使检索性能更少地依赖于ASR准确性。本概述文章旨在提供对这方面研究的主要技术贡献的概念,原理,方法和成就的全面概述。这包括五个主要方向:1)出于检索目的而修改的ASR:将ASR与文本检索级联,但是出于语音内容检索目的对ASR进行了修改或优化; 2)利用ASR输出中不存在的信息:尝试利用语音信号中的信息,这些信息在转录为音素和单词时不可避免地会丢失; 3)在没有ASR的情况下直接在声学级别进行匹配:对于语音查询,信号可以在声学级别而不是在音素或单词级别直接进行匹配,从而绕过所有ASR问题; 4)语音内容的语义检索:尝试检索与查询在语义上相关的语音内容,但不一定包括查询词本身; 5)交互式检索和有效地呈现被检索的对象:通过有效地呈现检索到的对象,结合用户动作的交互式检索过程可能会产生更好的检索结果和用户体验。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号