首页> 外文会议>International Conference on Information and Communications Technology >Multi document summarization for the Indonesian language based on latent dirichlet allocation and significance sentence
【24h】

Multi document summarization for the Indonesian language based on latent dirichlet allocation and significance sentence

机译:基于潜在狄利克雷分配和重要性句子的印尼语多文档摘要

获取原文

摘要

Automatic Multi-document summarization in Indonesian Language can help people to get more comprehensive online news information. The clustering algorithm which is widely developed over a decade in the text data domains is Latent Dirichlet Allocation (LDA). The LDA method contributes quite well in the field of text classification and information retrieval. One of LDA's usages is a document summarization method, since LDA is able to get the framework in a document. The multi-document summarization in Indonesian language using unsupervised learning approach, especially LDA, is still limited. The LDA and Significance Sentence methods have the advantage of choosing representative sentences from source documents. The testing model was performed using a combination of alpha parameters 0.1 and 0.001 as well as beta 0.001 and 0.1, which is combined with compression rate at 10%, 30% and 50% in the sentence ranking process of each document. Testing results show that the best result was obtained under parameters combination as follows: alpha value is 0.01, beta value is 0.1, compression rate is 50% and cosine similarity value is 0.931.
机译:印度尼西亚语的自动多文档摘要可以帮助人们获得更全面的在线新闻信息。在文本数据域中十多年来被广泛开发的聚类算法是潜在狄利克雷分配(LDA)。 LDA方法在文本分类和信息检索领域做出了很大贡献。 LDA的用途之一是文档汇总方法,因为LDA能够在文档中获取框架。使用无监督学习方法(尤其是LDA)以印度尼西亚语进行的多文档摘要仍然受到限制。 LDA和重要性句子方法的优点是可以从源文档中选择具有代表性的句子。测试模型是使用alpha参数0.1和0.001以及beta 0.001和0.1的组合来执行的,在每个文档的句子排名过程中,其压缩率分别为10%,30%和50%。测试结果表明,在如下参数组合下可获得最佳结果:α值为0.01,β值为0.1,压缩率为50%,余弦相似度值为0.931。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号