Query-relevant document representation for text clustering

机译：与查询相关的文档表示形式，用于文本聚类

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In text categorization, one well-known document representation is bag-of-words. Although it is simple and popular, it ignores semantics, underlying linguistic information, and word correlations. In this paper, a new representation for text data is proposed which is called Bag-Of-Queries (BOQ). First, a taxonomy of the terms in the local vocabulary is extracted. Extracting a taxonomy is performed by learning term dependencies using an information theoretic inclusion index. Next, the taxonomy is partitioned to generate a set of correlated terms or bag of queries. Since every two partitions belong to different concepts, they are considered seman-tically orthogonal queries. This provides a new space of orthogonal features, which is necessary for an efficient categorization. Finally, instead of using terms as features, we use them to build a set of queries. Documents are ranked in response to the queries using a similarity measure. The similarity indices are considered as new features in a vector space model representation. The proposed approach outperforms bag of word based clustering. It also extracts new non-redundant features and at the same time reduces dimensionality.

机译：在文本分类中，一种众所周知的文档表示形式是单词袋。尽管它很简单并且很流行，但是它忽略了语义，底层语言信息和单词相关性。在本文中，提出了一种新的文本数据表示形式，称为“查询袋”（BOQ）。首先，提取本地词汇中术语的分类法。提取分类法是通过使用信息理论包含索引来学习术语相关性来执行的。接下来，对分类法进行划分，以生成一组相关的术语或查询包。由于每两个分区属于不同的概念，因此它们被视为语义正交查询。这提供了正交特征的新空间，这对于有效分类是必需的。最后，我们没有使用术语作为功能，而是使用它们来构建一组查询。使用相似性度量对文档进行排序以响应查询。相似度索引被视为向量空间模型表示中的新特征。所提出的方法优于基于单词的聚类袋。它还提取了新的非冗余特征，同时降低了尺寸。

著录项

来源
《Fifth International Conference on Digital Information Management》|2010年|P.132-138|共7页
会议地点
作者
Makrehchi Masoud;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类人工智能理论;
关键词
入库时间 2022-08-26 15:01:20

相似文献

外文文献
中文文献
专利

1. An improved ant algorithm with LDA-based representation for text document clustering [J] . Aytug Onan, Hasan Bulut, Serdar Korukoglu Journal of Information Science . 2017,第2期

机译：一种基于LDA表示的改进蚁群算法用于文本文档聚类
2. GRAPH BASED TEXT REPRESENTATION FOR DOCUMENT CLUSTERING [J] . ASMA KHAZAAL ABDULSAHIB, SITI SAKIRA KAMARUDDIN Journal of Theoretical and Applied Information Technology . 2015,第1期

机译：用于文档聚类的基于图形的文本表示
3. REPRESENTING TEXT DOCUMENTS IN TRAINING DOCUMENT SPACES: A NOVEL MODEL FOR DOCUMENT REPRESENTATION [J] . ASMAA MOUNTASSIR, HOUDA BENBRAHIM, ILHAM BERRADA Journal of Theoretical and Applied Information Technology . 2013,第1期

机译：训练文档空间中的文本文档表示：一种新的文档表示模型
4. Query-relevant document representation for text clustering [C] . Makrehchi Masoud International Conference on Digital Information Management . 2010

机译：用于文本群集的查询相关文档表示
5. Text document topical recursive clustering and automatic labeling of a hierarchy of document clusters. [D] . Li, Xiaoxiao. 2012

机译：文本文档主题递归群集和文档群集层次结构的自动标记。
6. Assessing the Representation of Occupation Information in Free-Text Clinical Documents Across Multiple Sources [O] . Elizabeth A. Lindemann, Elizabeth S. Chen, Sripriya Rajamani, -1

机译：评估多种来源的自由文本临床文档中职业信息的表示形式
7. Document Clustering and Distributed Representation In E-commerce Text Analysis [O] . 蔡越 2016

机译：电子商务文本分析中的文档聚类和分布式表示

Query-relevant document representation for text clustering

摘要

著录项

相似文献

相关主题

期刊订阅