首页> 外文会议>Information Retrieval Technology >Multi- scale Text Tiling for Automatic Story Segmentation in Chinese Broadcast News
【24h】

Multi- scale Text Tiling for Automatic Story Segmentation in Chinese Broadcast News

机译:中文广播新闻中自动故事分割的多尺度文本切片

获取原文

摘要

This paper applies Chinese subword representations, namely character and syllable n-grams, into the TextTiling-based automatic story segmentation of Chinese broadcast news. We show the robustness of Chinese subwords against speech recognition errors, out-of-vocabulary (OOV) words and versatility in word segmentation in lexical matching on errorful Chinese speech recognition transcripts. We propose a multi-scale TextTiling approach that integrates both the specificity of words and the robustness of subwords in lexical similarity measure for story boundary identification. Experiments on the TDT2 Mandarin corpus show that subword bigrams achieve the best performance among all scales with relative f-measure improvement of 8.84% (character bigram) and 7.11% (syllable bigram) over words. Multi-scale fusion of subword bigrams with words can bring further improvement. It is promising that the integration of syllable bigram with syllable sequence of word achieves an f-measure gain of 2.66% over the syllable bigram alone.
机译:本文将中文子词表示形式,即字符和音节n-grams,应用到基于TextTiling的中文广播新闻自动故事分割中。我们展示了汉语子词对语音识别错误,词汇外(OOV)单词的健壮性以及在错误汉语语音识别转录本的词法匹配中分词的多功能性。我们提出了一种多尺度的TextTiling方法,该方法将词的特殊性和子词的鲁棒性集成在词汇相似性度量中,用于故事边界的识别。在TDT2普通话语料库上进行的实验表明,子词双字母组在所有音阶中表现最佳,相对于单词的f测度相对改进了8.84%(字符双字母组)和7.11%(音节双字母组)。子词双词组与词的多尺度融合可以带来进一步的改进。有希望的是,与单独的音节二元组相比,音节二元组与单词的音节序列的集成实现了2.66%的f-measure增益。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号