首页> 外文会议>10th Western Pacific Acoustics Conference. >A Study of Indexing Units for Japanese Spoken Document Retrieval
【24h】

A Study of Indexing Units for Japanese Spoken Document Retrieval

机译:日语语音文档检索的索引单位研究

获取原文
获取原文并翻译 | 示例

摘要

Spoken document retrieval (SDR) from Japanese lectures is addressed. In Japan, recently, lecture retrieval test collection (ad-hoc SDR task), which consists of 2,702 audio lectures of the Corpus of Spontaneous Japanese and 39 retrieval queries, has been designed. For an ad-hoc task, appropriate indexing is significant. Automatic speech recognition (ASR) is performed to make index terms, which essentially contain ASR errors. Therefore, studies of indexing terms that are robust to ASR errors are necessary. In Japanese text, no space is put between words, and word units are ambiguous. Thus, studies of indexing units are also important. Based on this background, indexing units are investigated in Japanese SDR. As for indexing units, morphemes, character N-grams, and combinations of the two are investigated. Morpheme unit indexing cannot deal with misrecognition of parts of words. Therefore, indexing units based on character N-grams are investigated. Although SDR has improved for some queries, we do not achieve an overall improvement. Combination with morpheme units did not work well. We confirmed the significance of the introduction of stop-word criteria in character N-gram-based indexing.
机译:解决了日语讲座中的语音文档检索(SDR)。在日本,最近,设计了演讲检索测试资料集(临时SDR任务),其中包括自发日语语料库的2 702场音频演讲和39个检索查询。对于临时任务,适当的索引编制非常重要。执行自动语音识别(ASR)以创建索引词,该词项实质上包含ASR错误。因此,有必要研究对ASR错误具有鲁棒性的索引项。在日语文本中,单词之间没有空格,并且单词单位不明确。因此,索引单元的研究也很重要。基于此背景,在日本SDR中研究了索引单位。至于索引单位,研究了词素,字符N-gram和二者的组合。词素单位索引不能处理单词部分的误识别。因此,研究了基于字符N元语法的索引单元。尽管SDR在某些查询方面有所改进,但我们并未实现整体改进。与语素单位结合使用效果不佳。我们确认了在基于字符N-gram的索引中引入停用词标准的重要性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号