Confronting Sparseness and High Dimensionality in Short Text Clustering via Feature Vector Projections

机译：通过特征向量投影面临短文本聚类的稀疏性和高维度

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Short text clustering is a popular problem that focuses on the unsupervised grouping of similar short text documents, or entitled entities. Since the short texts are currently being utilized in a vast number of applications, the problem in question has been rendered increasingly significant in the past few years. The high cluster homogeneity and completeness are two among the most important goals of all data clustering algorithms. However, in the context of short texts, their fulfilment is particularly difficult, because this type of data is typically represented by sparse vectors that collectively comprise a very high dimensional space. In this article we introduce VEPHC, a two-stage clustering algorithm designed to confront the sparseness and high dimensionality traits of short texts. During the first stage (or else, the VEP part), the initial feature vectors are projected onto a lower dimensional space by constructing and scoring variable-sized combinations of features (that is, terms). In the second stage (or else, the HC part), VEPHC improves the homogeneity and completeness of the generated clusters through split and merge operations that are based on the similarities of all inter-cluster elements. The experimental evaluation of VEPHC on two real-world datasets demonstrates its superior performance over numerous state-of-the-art clustering algorithms in terms of F1 scores and Normalized Mutual Information.

机译：短文本群集是一个流行的问题，重点关注类似的短文本文件或授权实体的无监督分组。由于目前在广大的应用程序中使用短文本，因此过去几年中有问题的问题越来越重要。高集群同质性和完整性是所有数据聚类算法中最重要的目标中的两个。然而，在短文本的背景下，它们的实现特别困难，因为这种类型的数据通常由稀疏向量表示，其共同包括非常高的尺寸空间。在本文中，我们介绍了vephc，这是一种两级聚类算法，旨在面对短文本的稀疏性和高维度特征。在第一阶段（或其他，VEP部分）期间，通过构造和评分可变特征的特征组合（即术语）来投射到较低的维度空间上的初始特征向量。在第二阶段（或其他，HC部分）中，Vephc通过基于所有群集元素的相似性的分割和合并操作来提高所生成的集群的同质性和完整性。 VEPHC对两个现实世界数据集的实验评估在F1分数和规范化的相互信息方面，对众多最先进的聚类算法进行了卓越的性能。

著录项

来源
《IEEE International Conference on Tools with Artificial Intelligence》|2020年|813-820|共8页
会议地点
作者
Leonidas Akritidis; Miltiadis Alamaniotis; Athanasios Fevgas; Panayiotis Bozanis;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Conferences; Clustering methods; Merging; Clustering algorithms; Tools; Artificial intelligence; Mutual information;

机译：会议;聚类方法;合并;聚类算法;工具;人工智能;相互信息;

相似文献

外文文献
中文文献
专利

1. Spectral clustering of high-dimensional data exploiting sparse representation vectors [J] . Sen Wu, Xiaodong Feng, Wenjun Zhou Neurocomputing . 2014,第jula5期

机译：利用稀疏表示向量的高维数据谱聚类
2. Sparse two-dimensional discriminant locality-preserving projection (S2DDLPP) for feature extraction [J] . Wan Minghua, Yang Guowei, Sun Chengli, Soft computing: A fusion of foundations, methodologies and applications . 2019,第14期

机译：特征提取的稀疏二维判别位置保存投影（S2DDLPP）
3. Dimensionality Reduction of Sparse Visual Features via Recoverable Projection for Large-Scale Image Retrieval [J] . Zaixing HE, Takahiro OGAWA, Miki HASEYAMA 電子情報通信学会技術研究報告 . 2012,第442期

机译：通过可恢复投影的大规模视觉检索来减少稀疏视觉特征的维数
4. The Clustering-Based Initialization for Non-negative Matrix Factorization in the Feature Transformation of the High-Dimensional Text Categorization System: A Viewpoint of Term Vectors [C] . Le Nguyen Hoai Nam, Ho Bao Quoc International conference on theory and practice of digital libraries . 2017

机译：高维文本分类系统特征转换中基于非负矩阵分解的基于聚类的初始化：术语向量的观点
5. Semi-supervised clustering for high-dimensional and sparse features. [D] . Yan, Su. 2010

机译：高维和稀疏特征的半监督聚类。
6. 2D–EM clustering approach for high-dimensional data through folding feature vectors [O] . Alok Sharma, Piotr J. Kamola, Tatsuhiko Tsunoda 2017

机译：通过折叠特征向量对高维数据进行2D–EM聚类的方法
7. Compressing Sparse Feature Vectors using Random Ortho-Projections 1 [O] . Esa Rahtu, Mikko Salo, Janne Heikkilä 2013

机译：使用随机正射投影压缩稀疏特征向量1

Confronting Sparseness and High Dimensionality in Short Text Clustering via Feature Vector Projections

摘要

著录项

相似文献

相关主题

期刊订阅