首页> 外文OA文献 >Multidocument Arabic Text Summarization Based on Clustering and Word2Vec to Reduce Redundancy
【2h】

Multidocument Arabic Text Summarization Based on Clustering and Word2Vec to Reduce Redundancy

机译:基于聚类和Word2VEC来减少冗余的MultiDocument阿拉伯文摘要

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Arabic is one of the most semantically and syntactically complex languages in the world. A key challenging issue in text mining is text summarization, so we propose an unsupervised score-based method which combines the vector space model, continuous bag of words (CBOW), clustering, and a statistically-based method. The problems with multidocument text summarization are the noisy data, redundancy, diminished readability, and sentence incoherency. In this study, we adopt a preprocessing strategy to solve the noise problem and use the word2vec model for two purposes, first, to map the words to fixed-length vectors and, second, to obtain the semantic relationship between each vector based on the dimensions. Similarly, we use a k-means algorithm for two purposes: (1) Selecting the distinctive documents and tokenizing these documents to sentences, and (2) using another iteration of the k-means algorithm to select the key sentences based on the similarity metric to overcome the redundancy problem and generate the initial summary. Lastly, we use weighted principal component analysis (W-PCA) to map the sentences’ encoded weights based on a list of features. This selects the highest set of weights, which relates to important sentences for solving incoherency and readability problems. We adopted Recall-Oriented Understudy for Gisting Evaluation (ROUGE) as an evaluation measure to examine our proposed technique and compare it with state-of-the-art methods. Finally, an experiment on the Essex Arabic Summaries Corpus (EASC) using the ROUGE-1 and ROUGE-2 metrics showed promising results in comparison with existing methods.
机译:阿拉伯语是世界上最义的语义和奇妙的复杂语言之一。文本挖掘中的一个关键具有挑战性的问题是文本摘要,因此我们提出了一种无监督的得分的方法,该方法结合了矢量空间模型,连续的单词(CBow),聚类和基于统计的方法。 Multivocument文本摘要的问题是嘈杂的数据,冗余,可读性和句子间距。在这项研究中,我们采用预处理策略来解决噪声问题并使用Word2VEC模型进行两种用途,首先将单词映射到固定长度向量,而第二,以基于尺寸获得每个向量之间的语义关系。同样,我们使用K-means算法有两个目的:(1)使用k-means算法的另一个迭代来选择独特的文档并将这些文档授权到句子,(2)基于相似度量选择关键句子克服冗余问题并生成初始摘要。最后,我们使用加权主成分分析(W-PCA)根据特征列表来映射句子编码的权重。这选择了最高的权重,这与解决不连锁性和可读性问题的重要句子有关。我们采用了召回考虑的思考,用于调用评估(Rouge)作为检查我们所提出的技术的评估措施,并将其与最先进的方法进行比较。最后,使用Rouge-1和Rouge-2度量的Essex阿拉伯语摘要语料库(EASC)的实验表明,与现有方法相比,有希望的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号