首页> 美国卫生研究院文献>Journal of Biomedical Semantics >Thematic clustering of text documents using an EM-based approach
【2h】

Thematic clustering of text documents using an EM-based approach

机译:使用基于EM的方法对文本文档进行主题聚类

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Clustering textual contents is an important step in mining useful information on the web or other text-based resources. The common task in text clustering is to handle text in a multi-dimensional space, and to partition documents into groups, where each group contains documents that are similar to each other. However, this strategy lacks a comprehensive view for humans in general since it cannot explain the main subject of each cluster. Utilizing semantic information can solve this problem, but it needs a well-defined ontology or pre-labeled gold standard set. In this paper, we present a thematic clustering algorithm for text documents. Given text, subject terms are extracted and used for clustering documents in a probabilistic framework. An EM approach is used to ensure documents are assigned to correct subjects, hence it converges to a locally optimal solution. The proposed method is distinctive because its results are sufficiently explanatory for human understanding as well as efficient for clustering performance. The experimental results show that the proposed method provides a competitive performance compared to other state-of-the-art approaches. We also show that the extracted themes from the MEDLINE® dataset represent the subjects of clusters reasonably well.
机译:群集文本内容是在Web或其他基于文本的资源上挖掘有用信息的重要步骤。文本聚类中的常见任务是处理多维空间中的文本,并将文档划分为组,其中每个组包含彼此相似的文档。但是,由于该策略无法解释每个集群的主要主题,因此总体上对人类缺乏全面的了解。利用语义信息可以解决此问题,但它需要一个定义明确的本体或预先标记的黄金标准集。在本文中,我们提出了一种文本文档的主题聚类算法。给定文本,将提取主题词并将其用于概率框架中的文档聚类。 EM方法用于确保将文档分配给正确的主题,因此可以收敛到局部最优的解决方案。所提出的方法之所以与众不同,是因为其结果对于人类的理解是充分的解释,并且对于聚类性能是有效的。实验结果表明,与其他最新方法相比,该方法具有竞争优势。我们还表明,从MEDLINE ®数据集中提取的主题可以很好地代表集群的主题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号