首页> 外文会议>International conference on brain-inspired cognitive systems >Self-validated Story Segmentation of Chinese Broadcast News
【24h】

Self-validated Story Segmentation of Chinese Broadcast News

机译:自我验证的中国广播新闻故事分段

获取原文

摘要

Automatic story segmentation is an important prerequisite for semantic-level applications. The normalized cuts (NCuts) method has recently shown great promise for segmenting English spoken lectures. However, the availability assumption of the exact story number per file significantly limits its capability to handle a large number of transcripts. Besides, how to apply such method to Chinese language in the presence of speech recognition errors is unclear yet. Addressesing these two problems, we propose a self-validated NCuts (SNCuts) algorithm for segmenting Chinese broadcast news via inaccurate lexical cues, generated by the Chinese large vocabulary continuous speech recognizer (LVCSR). Due to the specialty of Chinese language, we present a subword-level graph embedding for the erroneous LVCSR transcripts. We regularize the NCuts criterion by a general exponential prior of story numbers, respecting the principle of Occam's razor. Given the maximum story number as a general parameter, we can automatically obtain reasonable segmentations for a large number of news transcripts, with the story numbers automatically determined for each file, and with comparable complexity to alternative non-self-validated methods. Extensive experiments on benchmark corpus show that: (ⅰ) the proposed SNCuts algorithm can efficiently produce comparable or even better segmentation quality, as compared to other state-of-the-art methods with true story number as an input parameter; and (ⅱ) the subword-level embedding always helps to recovering lexical cohesion in Chinese erroneous transcripts, thus improving both segmentation accuracy and robustness to LVCSR errors.
机译:自动故事分割是语义级应用程序的重要先决条件。最近,归一化剪切(NCuts)方法在分割英语口语课程方面显示出了巨大的希望。但是,每个文件的确切故事编号的可用性假设大大限制了其处理大量成绩单的能力。此外,尚不清楚如何在存在语音识别错误的情况下将这种方法应用于中文。针对这两个问题,我们提出了一种自我验证的NCuts(SNCuts)算法,用于通过中文大词汇量连续语音识别器(LVCSR)生成的不正确的词汇提示对中文广播新闻进行细分。由于中文的特殊性,我们为错误的LVCSR成绩单提供了一个子词级图嵌入。我们遵循故事编号的一般指数先验对NCuts准则进行规范化,同时尊重Occam剃刀的原理。给定最大故事编号作为一般参数,我们可以自动获取大量新闻记录的合理片段,并自动为每个文件确定故事编号,并且其复杂性与其他非自我验证方法相当。在基准语料库上的大量实验表明:(ⅰ)与其他以真实故事编号作为输入参数的最新方法相比,所提出的SNCuts算法可以有效地产生可比甚至更好的分割质量; (ⅱ)子词级嵌入始终有助于恢复中文错误笔录中的词汇衔接,从而提高了切分准确性和对LVCSR错误的鲁棒性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号