首页> 外文期刊>Multimedia Tools and Applications >A novel ensemble statistical topic extraction method for scientific publications based on optimization clustering
【24h】

A novel ensemble statistical topic extraction method for scientific publications based on optimization clustering

机译:基于优化聚类的科学出版物新颖的集合统计主题提取方法

获取原文
获取原文并翻译 | 示例
           

摘要

The automatic topic extraction (TE) from scientific publications provides a very compact summary of the clusters' contents. This often helps in locating information easily. TE enables us to define the boundaries of the scientific fields. Text Document Clustering (TDC) represents, in general, the first step of topic identification to identify the documents, which address a related subject matter. Metaheuristics are typically used as efficient approaches for TDC. The multi-verse optimizer algorithm (MVO) involves a stochastic population-based algorithm. It has been recently proposed and successfully utilized to tackle many hard optimization problems. In the TE process, the focus of each statistical TE method is placed on various language feature space aspects. The aim of this paper is to design a novel ensemble method for an automatic TE from a collection of scientific publications based on MVO as the clustering algorithm. The automatic TE, which is used in our approach, is term frequency-inverse document frequency (TF-IDF), most frequent based keyword extraction (TF), co-occurrence statistical information-based keyword extraction (CSI), TextRank (TR), and mutual information (MI). A group of candidate topics can be provided by each automatic TE method for the proposed ensemble method. Next, the ensemble approach prunes the candidate topics' set via the application of a specific filtering heuristic. Then, their scores are recalculated based on the prescribed metrics. After that, for selecting a set of topics for certain scientific publications, dynamic threshold functions are applied. The findings emphasized the refined candidate set's efficiency, as well as effectiveness. The results also showed that the system's quality has been improved by new topics. The proposed method achieved better precision, as well as recall on a similar dataset compared to the state-of-the-art TE methods.
机译:来自科学出版物的自动主题提取(TE)提供了一个非常紧凑的集群内容摘要。这通常有助于您容易地定位信息。 TE使我们能够定义科学领域的界限。文本文档群集(TDC)一般代表主题识别的第一步,以识别该文件,该文件地解决了相关主题。综合学通常用作TDC的有效方法。多韵的优化器算法(MVO)涉及一种基于随机群体的算法。最近已经提出并成功地利用来解决许多艰难的优化问题。在TE过程中,每个统计TE方法的焦点都放在各种语言特征空间方面。本文的目的是为基于MVO作为聚类算法的科学出版物集合设计一种新的集合方法。在我们的方法中使用的自动TE是术语频率 - 逆文档频率(TF-IDF),最常用的基于的关键字提取(TF),基于共同发生的基于统计信息的关键字提取(CSI),Textrank(TR) ,和互信息(mi)。每个自动TE方法都可以提供一组候选主题,用于所提出的集合方法。接下来,集合方法通过应用特定过滤启发式的应用程序来修剪候选主题。然后,它们的分数基于规定的指标重新计算。之后,为了为某些科学出版物选择一组主题,应用动态阈值函数。调查结果强调了精致的候选集的效率,以及有效性。结果还表明,该系统的质量得到了新的主题。与最先进的TE方法相比,所提出的方法实现了更好的精度,以及在类似的数据集上召回。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号