首页> 外文会议>International Conference on Semantics, Knowledge and Grids >An Exploratory Study of Enhancing Text Clustering with Auto-Generated Semantic Tags
【24h】

An Exploratory Study of Enhancing Text Clustering with Auto-Generated Semantic Tags

机译:增强自动生成语义标记的文本聚类的探索性研究

获取原文

摘要

With the exponentially growing volume of digital documents and internet content, it becomes very challenging to locate right information when desired. We heavily rely on search engines but most existing search tools are key-word based and they often return search results with low precision and recall. The emerging semantic tagging technology provides an automatic way to generate semantic tags from text. It has drawn more and more interest from text mining research communities. It is critical to study how to utilize semantic tags to improve text mining including clustering, which helps users to enhance their experience of searching and browsing documents. Unfortunately, most previous works on text clustering merely based on content information. A few recent researches take user-generated tags into account, however user generated tags are often noisy, inconsistent, redundant and lack of semantic information and hierarchical structure. In this work, we propose a Semantic Text Mining (STeM) framework to generate semantic tags for given documents and then utilize the semantic tags to improve text clustering. Different from the previous works, we represent a document by a combination of domains and high quality noun phrases. We investigate the performance of our methods in two different datasets and the results are evaluated by normalized mutual information. Experiment results demonstrated that our proposed method substantially outperformed the traditional Term Frequency-Inverse Document Frequency (TF-IDF) term vector based clustering. We find that incorporating semantic information into document representation is critical to improve the performance of text clustering.
机译:随着数字文档和互联网内容的指数增长,在需要时定位正确的信息变得非常具有挑战性。我们依赖搜索引擎,但大多数现有搜索工具是基于密钥字的,并且他们经常使用低精度和召回来返回搜索结果。新兴语义标记技术提供了一种自动方法来从文本生成语义标记。它从文本挖掘研究社区中吸引了越来越多的兴趣。研究如何利用语义标签来改善文本挖掘至关重要,包括群集,这有助于用户增强他们的搜索和浏览文档的体验。不幸的是,最先前的基于内容信息的文本聚类工作。最近的一些研究考虑了用户生成的标签,但是用户生成的标签通常是嘈杂的,不一致的,冗余和缺少语义信息和层级结构的。在这项工作中,我们提出了一个语义文本挖掘(Stew)框架来为给定文档生成语义标记,然后利用语义标记来改进文本群集。与以前的作品不同,我们通过域和高质量的名词短语组合代表文档。我们调查我们在两个不同的数据集中的方法的性能,结果通过标准化的相互信息进行评估。实验结果表明,我们所提出的方法基本上超越了传统术语频率 - 逆文档频率(TF-IDF)术语基于载体的聚类。我们发现将语义信息结合到文档表示至关重要,以提高文本群集的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号