首页> 外文会议>Fifth International Conference on Digital Information Management >Query-relevant document representation for text clustering
【24h】

Query-relevant document representation for text clustering

机译:与查询相关的文档表示形式,用于文本聚类

获取原文

摘要

In text categorization, one well-known document representation is bag-of-words. Although it is simple and popular, it ignores semantics, underlying linguistic information, and word correlations. In this paper, a new representation for text data is proposed which is called Bag-Of-Queries (BOQ). First, a taxonomy of the terms in the local vocabulary is extracted. Extracting a taxonomy is performed by learning term dependencies using an information theoretic inclusion index. Next, the taxonomy is partitioned to generate a set of correlated terms or bag of queries. Since every two partitions belong to different concepts, they are considered seman-tically orthogonal queries. This provides a new space of orthogonal features, which is necessary for an efficient categorization. Finally, instead of using terms as features, we use them to build a set of queries. Documents are ranked in response to the queries using a similarity measure. The similarity indices are considered as new features in a vector space model representation. The proposed approach outperforms bag of word based clustering. It also extracts new non-redundant features and at the same time reduces dimensionality.
机译:在文本分类中,一种众所周知的文档表示形式是单词袋。尽管它很简单并且很流行,但是它忽略了语义,底层语言信息和单词相关性。在本文中,提出了一种新的文本数据表示形式,称为“查询袋”(BOQ)。首先,提取本地词汇中术语的分类法。提取分类法是通过使用信息理论包含索引来学习术语相关性来执行的。接下来,对分类法进行划分,以生成一组相关的术语或查询包。由于每两个分区属于不同的概念,因此它们被视为语义正交查询。这提供了正交特征的新空间,这对于有效分类是必需的。最后,我们没有使用术语作为功能,而是使用它们来构建一组查询。使用相似性度量对文档进行排序以响应查询。相似度索引被视为向量空间模型表示中的新特征。所提出的方法优于基于单词的聚类袋。它还提取了新的非冗余特征,同时降低了尺寸。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号