首页> 外文会议>Chinese Spoken Language Processing; Lecture Notes in Artificial Intelligence; 4274 >Initial Experiments on Automatic Story Segmentation in Chinese Spoken Documents Using Lexical Cohesion of Extracted Named Entities
【24h】

Initial Experiments on Automatic Story Segmentation in Chinese Spoken Documents Using Lexical Cohesion of Extracted Named Entities

机译:利用提取的命名实体的词汇衔接在汉语口语文档中自动故事分割的初步实验

获取原文
获取原文并翻译 | 示例

摘要

Story segmentation plays a critical role in spoken document processing. Spoken documents often come in a continuous audio stream without explicit boundaries related to stories or topics. It is important to be able to automatically segment these audio streams into coherent units. This work is an initial attempt to make use of informative lexical terms (or key terms) in recognition transcripts of Chinese spoken documents for story segmentation. This is because changes in the distribution of informative terms are generally associated with story changes and topic shifts. Our methods of information lexical term extraction include the extraction of POS-tagged nouns, as well as a named entity identifier that extracts Chinese person names, transliterated person names, location and organization names. We also adopted a lexical chaining approach that links up sentences that are lexically "coherent" with each other. This leads to the definition of a lexical chain score that is used for story boundary hypothesis. We conducted experiments on the recognition transcripts of the TDT2 Voice of America Mandarin speech corpus. We compared among several methods of story segmentation, including the use of pauses for story segmentation, the use of lexical chains of all lexical entries in the recognition transcripts, the use of lexical chains of nouns tagged by a part-of-speech tagger, as well as the use of lexical chains of extracted named entities. Lexical chains of informative terms, namely POS-tagged nouns and named entities were found to give comparable performance (F-measures of 0.71 and 0.73 respectively), which is superior to the use of all lexical entries (F-measure of 0.69).
机译:故事分割在语音文档处理中起着至关重要的作用。语音文档通常以连续的音频流形式出现,而没有与故事或主题相关的明确界限。能够将这些音频流自动分段为相干单元非常重要。这项工作是在汉语口语文档的识别成绩单中使用翔实的词汇(或关键词)进行故事分割的初步尝试。这是因为,信息术语分布的变化通常与故事变化和话题转移相关。我们的信息词汇术语提取方法包括POS标记名词的提取以及命名实体标识符的提取,该实体标识符提取中文人名,音译人名,位置和组织名。我们还采用了一种词法链接方法,该方法将词法上彼此“连贯”的句子链接起来。这导致了用于故事边界假设的词汇链分数的定义。我们对TDT2美国之音普通话语料库的识别笔录进行了实验。我们在故事分割的几种方法中进行了比较,包括使用停顿进行故事分割,使用识别记录中所有词条的词链,使用由词性标记器标记的名词的词链,例如以及使用提取的命名实体的词汇链。发现信息量词的词汇链,即带有POS标签的名词和命名实体,具有可比的性能(F量度分别为0.71和0.73),优于所有词汇条目(F量度为0.69)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号