首页> 外文期刊>Computing and informatics >THAI MULTI-DOCUMENT SUMMARIZATION: UNIT SEGMENTATION, UNIT-GRAPH FORMULATION, AND UNIT SELECTION
【24h】

THAI MULTI-DOCUMENT SUMMARIZATION: UNIT SEGMENTATION, UNIT-GRAPH FORMULATION, AND UNIT SELECTION

机译:泰国多文档摘要:单元分割,单元图公式化和单元选择

获取原文
获取原文并翻译 | 示例
       

摘要

There have been several challenges in summarization of Thai multiple documents since Thai language itself lacks of explicit word/phrase/sentence boundaries. This paper gives definition of Thai Elementary Discourse Unit (TEDU) and then presents our three-stage summarization process. Towards implementation of this process, we propose unit segmentation using TEDUs and their derivatives, unit graph formation using iterative unit weighting and cosine similarity, and unit selection using highest-weight priority, redundancy removal, and post-selection weight recalculation. To examine performance of the proposed methods, a number of experiments are conducted using fifty sets of Thai news articles with their manually constructed reference summary. By three common evaluation measures of ROUGE-1, ROUGE-2, and ROUGE-SU4, the results evidence that (1) our TEDU-based summarization outperforms paragraph-based summarization, (2) our iterative weighting is superior to traditional TF-IDF, (3) the highest-weight priority without centroid preference and unit redundancy consideration helps improving summary quality, and (4) post-selection weight recalculation tends to raise summarization performance under some certain circumstances.
机译:由于泰国语言本身缺乏明确的单词/短语/句子边界,因此在汇总泰国多个文档时遇到了一些挑战。本文给出了泰国基本语篇单元(TEDU)的定义,然后介绍了我们的三阶段总结过程。为了实现此过程,我们建议使用TEDU及其衍生物进行单元分割,使用迭代单元加权和余弦相似度来形成单元图,并使用最高权重优先级,冗余去除和选择后权重计算来进行单元选择。为了检查所提出方法的性能,使用了50套泰国新闻及其人工构建的参考摘要进行了许多实验。通过ROUGE-1,ROUGE-2和ROUGE-SU4的三种常见评估方法,结果表明(1)我们基于TEDU的汇总优于基于段落的汇总,(2)我们的迭代加权优于传统TF-IDF ,(3)没有质心偏好和单元冗余考虑的最高权重优先级有助于改善摘要质量,(4)在某些情况下,选择后权重重新计算往往会提高汇总性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号