【24h】

Automatic Arabic text summarization using clustering and keyphrase extraction

机译:使用聚类和关键词提取自动阿拉伯语文本摘要

获取原文
获取原文并翻译 | 示例

摘要

As the number of electronic documents increases rapidly, the need for faster techniques to assess the relevance of these documents emerges. A summary is a concise representation of underlying text. A full understanding of the document is essential to form an ideal summary. However, achieving full understanding is either difficult or impossible for computers. Therefore, selecting important sentences from the original text and presenting these sentences as a summary present the most common techniques in automated text summarization. This paper propose a hybrid clustering method(partitioning and hierarchical) to group many Arabic documents into several clusters .Then keyphrase extraction module is applied to extract important Keyphrases from each cluster, which helps identify the most important sentences and find similar sentences based on several similarity algorithms. It applied to extract one sentence from a group of similar sentences while ignoring the other similar sentences (i.e., sentences that have a greater similarity than the predefined threshold). This model is designed for both single-and multi-document Arabic text summarization. The Recall-Oriented Understudy for Gisting Evaluation (ROGUE) matrix used for the evaluation. For the summarization dataset, Essex Arabic Summaries Corpus was used. It has many topic based articles with multiple human summaries. This model achieved an accuracy of 80 % for single-document and 62% for multi-document summarization.
机译:随着电子文档的数量迅速增加,对评估这些文档的相关性的更快技术的需求出现了。摘要是基础文本的简洁表示。对文档的充分理解对于形成理想的摘要至关重要。但是,对于计算机而言,全面理解是困难的还是不可能的。因此,从原始文本中选择重要的句子并以摘要形式呈现这些句子是自动文本摘要中最常用的技术。本文提出了一种混合的聚类方法(分区和分层),将许多阿拉伯文文档分为多个聚类。然后,应用关键词组提取模块从每个聚类中提取重要的关键词组,这有助于识别最重要的句子并基于相似度找到相似的句子算法。它适用于从一组相似的句子中提取一个句子,而忽略其他相似的句子(即,具有比预定义阈值更大的相似性的句子)。该模型设计用于单文档和多文档阿拉伯文本摘要。用于评估的面向召回的针对迷恋评估的未成年人(ROGUE)矩阵。对于汇总数据集,使用了Essex阿拉伯语摘要语料库。它有许多基于主题的文章,其中包含多个人工摘要。该模型对单文档的准确性达到80%,对于多文档摘要的准确性达到62%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号