An Exploratory Study of Enhancing Text Clustering with Auto-Generated Semantic Tags

机译：利用自动生成的语义标签增强文本聚类的探索性研究

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

With the exponentially growing volume of digital documents and internet content, it becomes very challenging to locate right information when desired. We heavily rely on search engines but most existing search tools are key-word based and they often return search results with low precision and recall. The emerging semantic tagging technology provides an automatic way to generate semantic tags from text. It has drawn more and more interest from text mining research communities. It is critical to study how to utilize semantic tags to improve text mining including clustering, which helps users to enhance their experience of searching and browsing documents. Unfortunately, most previous works on text clustering merely based on content information. A few recent researches take user-generated tags into account, however user generated tags are often noisy, inconsistent, redundant and lack of semantic information and hierarchical structure. In this work, we propose a Semantic Text Mining (STeM) framework to generate semantic tags for given documents and then utilize the semantic tags to improve text clustering. Different from the previous works, we represent a document by a combination of domains and high quality noun phrases. We investigate the performance of our methods in two different datasets and the results are evaluated by normalized mutual information. Experiment results demonstrated that our proposed method substantially outperformed the traditional Term Frequency-Inverse Document Frequency (TF-IDF) term vector based clustering. We find that incorporating semantic information into document representation is critical to improve the performance of text clustering.

机译：随着数字文档和互联网内容的数量呈指数增长，在需要时定位正确的信息变得非常具有挑战性。我们严重依赖搜索引擎，但是大多数现有的搜索工具都是基于关键字的，它们通常会以较低的准确性和召回率返回搜索结果。新兴的语义标记技术提供了一种从文本生成语义标记的自动方法。它已引起文本挖掘研究社区的越来越多的兴趣。研究如何利用语义标记来改善包括集群在内的文本挖掘至关重要，这有助于用户增强其搜索和浏览文档的体验。不幸的是，大多数先前的研究仅基于内容信息进行文本聚类。最近的一些研究将用户生成的标签考虑在内，但是用户生成的标签通常嘈杂，不一致，冗余且缺乏语义信息和层次结构。在这项工作中，我们提出了一个语义文本挖掘（STeM）框架来生成给定文档的语义标签，然后利用语义标签来改善文本聚类。与以前的作品不同，我们通过域和高质量名词短语的组合来表示文档。我们研究了我们的方法在两个不同数据集中的性能，并通过归一化的互信息对结果进行了评估。实验结果表明，我们提出的方法大大优于传统的词频-文档频率逆词频（TF-IDF）词向量聚类。我们发现将语义信息合并到文档表示中对于提高文本聚类的性能至关重要。

著录项

来源
《2012 Eighth International Conference on Semantics, Knowledge and Grids.》|2012年|p.104- 111|共8页
会议地点 Beijing(CN);Beijing(CN)
作者
Tang Xuning; Dang Jiangbo;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类程序设计;程序设计;
关键词

相似文献

外文文献
中文文献
专利

1. Enhanced cross-domain document clustering with a semantically enhanced text stemmer (SETS) [J] . Ivan Stankov, Diman Todorov, Rossitza Setchi International journal of knowledge-based and intelligent engineering systems . 2013,第2期

机译：使用语义增强的文本词干分析器（SETS）增强的跨域文档聚类
2. An Exploratory Study on the Policy for Facilitating of Health Behaviors Related to Particulate Matter: Using Topic and Semantic Network Analysis of Media Text [J] . Hye Min Byun, You Jin Park, Eun Kyoung Yun Journal of Korean Academy of Nursing . 2021,第1期

机译：促进颗粒物问题促进健康行为的政策探索性研究：使用媒体文本的主题和语义网络分析
3. Extract the Semantic Meaning of Prepositions at Arabic Texts: An Exploratory Study [J] . Mohammad Khaled A. Al-Maghasbeh, Mohd Pouzi Bin Hamzah International Journal of Computer Trends and Technology . 2015,第3期

机译：提取阿拉伯语介词的语义含义：一项探索性研究
4. An Exploratory Study of Enhancing Text Clustering with Auto-Generated Semantic Tags [C] . Xuning Tang, Jiangbo Dang 2012 Eighth International Conference on Semantics, Knowledge and Grids. . 2012

机译：利用自动生成的语义标签增强文本聚类的探索性研究
5. Semantic preserving text representation and its applications in text clustering. [D] . Howard, Michael. 2012

机译：语义保留文本表示及其在文本聚类中的应用。
6. ‘MATRI-SUMAN’ a capacity building and text messaging intervention to enhance maternal and child health service utilization among pregnant women from rural Nepal: study protocol for a cluster randomised controlled trial [O] . Jitendra Kumar Singh, Rajendra Kadel, Dilaram Acharya, 2018

机译：MATRI-SUMAN能力建设和短信干预措施旨在提高尼泊尔农村孕妇的母婴保健服务利用率：一项整群随机对照试验的研究方案
7. Arabic Text Summarization Based on Latent Semantic Analysis to Enhance Arabic Documents Clustering [O] . Hanane Froud, Abdelmonaime Lachkar, Said Alaoui Ouatik 2013

机译：基于潜在语义分析的阿拉伯文文本摘要，增强阿拉伯文档聚类

An Exploratory Study of Enhancing Text Clustering with Auto-Generated Semantic Tags

摘要

著录项

相似文献

相关主题

期刊订阅