首页> 外文会议>2012 Eighth International Conference on Semantics, Knowledge and Grids. >An Exploratory Study of Enhancing Text Clustering with Auto-Generated Semantic Tags
【24h】

An Exploratory Study of Enhancing Text Clustering with Auto-Generated Semantic Tags

机译:利用自动生成的语义标签增强文本聚类的探索性研究

获取原文
获取原文并翻译 | 示例

摘要

With the exponentially growing volume of digital documents and internet content, it becomes very challenging to locate right information when desired. We heavily rely on search engines but most existing search tools are key-word based and they often return search results with low precision and recall. The emerging semantic tagging technology provides an automatic way to generate semantic tags from text. It has drawn more and more interest from text mining research communities. It is critical to study how to utilize semantic tags to improve text mining including clustering, which helps users to enhance their experience of searching and browsing documents. Unfortunately, most previous works on text clustering merely based on content information. A few recent researches take user-generated tags into account, however user generated tags are often noisy, inconsistent, redundant and lack of semantic information and hierarchical structure. In this work, we propose a Semantic Text Mining (STeM) framework to generate semantic tags for given documents and then utilize the semantic tags to improve text clustering. Different from the previous works, we represent a document by a combination of domains and high quality noun phrases. We investigate the performance of our methods in two different datasets and the results are evaluated by normalized mutual information. Experiment results demonstrated that our proposed method substantially outperformed the traditional Term Frequency-Inverse Document Frequency (TF-IDF) term vector based clustering. We find that incorporating semantic information into document representation is critical to improve the performance of text clustering.
机译:随着数字文档和互联网内容的数量呈指数增长,在需要时定位正确的信息变得非常具有挑战性。我们严重依赖搜索引擎,但是大多数现有的搜索工具都是基于关键字的,它们通常会以较低的准确性和召回率返回搜索结果。新兴的语义标记技术提供了一种从文本生成语义标记的自动方法。它已引起文本挖掘研究社区的越来越多的兴趣。研究如何利用语义标记来改善包括集群在内的文本挖掘至关重要,这有助于用户增强其搜索和浏览文档的体验。不幸的是,大多数先前的研究仅基于内容信息进行文本聚类。最近的一些研究将用户生成的标签考虑在内,但是用户生成的标签通常嘈杂,不一致,冗余且缺乏语义信息和层次结构。在这项工作中,我们提出了一个语义文本挖掘(STeM)框架来生成给定文档的语义标签,然后利用语义标签来改善文本聚类。与以前的作品不同,我们通过域和高质量名词短语的组合来表示文档。我们研究了我们的方法在两个不同数据集中的性能,并通过归一化的互信息对结果进行了评估。实验结果表明,我们提出的方法大大优于传统的词频-文档频率逆词频(TF-IDF)词向量聚类。我们发现将语义信息合并到文档表示中对于提高文本聚类的性能至关重要。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号