【24h】

A Method of Topic Detection for Great Volume of Data

机译:一种关于大量数据的主题检测方法

获取原文

摘要

Topics extraction has become increasingly important due to its effectiveness in many tasks, including information filtering, information retrieval and organization of document collections in digital libraries. The Topic Detection consists to find the most significant topics within a document corpus. In this paper we explore the adoption of a methodology of feature reduction to underline the most significant topics within a document corpus. We used an approach based on a clustering algorithm (X-means) over the t f - id f matrix calculated starting from the corpus, by which we describe the frequency of terms, represented by the columns, that occur in each document, represented by a row. To extract the topics, we build n binary problems, where n is the numbers of clusters produced by an unsupervised clustering approach and we operate a supervised feature selection over them considering the top features as the topic descriptors. We will show the results obtained on two different corpora. Both collections are expressed in Italian: the first collection consists of documents of the University of Naples Federico II, the second one consists in a collection of medical records.
机译:由于许多任务中的有效性,包括信息过滤,信息检索和数字库中的文档集合的组织,提取越来越重要。该主题检测包括在文档语料库中找到最重要的主题。在本文中,我们探讨采用特征减少方法,以强调文档语料库中最重要的主题。我们在从语料库开始计算的TF-ID F矩阵上使用了一种基于聚类算法(X-Means)的方法,通过它描述由列中所示的列中出现的列表示的术语频率排。要提取主题,我们构建了n个二进制问题,其中n是由无监督的聚类方法产生的群集数,并且我们考虑主题描述符的顶部功能,我们操作的监督功能选择。我们将展示在两个不同的基础上获得的结果。两家集合都在意大利语中表达:第一个收藏组成的是那不勒斯联邦政府II大学的文件,第二个收集包括一系列医疗记录。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号