Scalable k-NN based text clustering

机译：基于可扩展的K-NN文本群集

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Clustering items using textual features is an important problem with many applications, such as root-cause analysis of spam campaigns, as well as identifying common topics in social media. Due to the sheer size of such data, algorithmic scalability becomes a major concern. In this work, we present our approach for text clustering that builds an approximate k-NN graph, which is then used to compute connected components representing clusters. Our focus is to understand the scalability / accuracy tradeoff that underlies our method: we do so through an extensive experimental campaign, where we use real-life datasets, and show that even rough approximations of k-NN graphs are sufficient to identify valid clusters. Our method is scalable and can be easily tuned to meet requirements stemming from different application domains.

机译：使用文本功能的聚类项目是许多应用程序的重要问题，例如垃圾邮件广告系列的根本原因分析，以及识别社交媒体中的常见主题。由于此类数据的庞大规模，算法可伸缩性成为主要问题。在这项工作中，我们介绍了构建近似K-NN图形的文本群集的方法，然后将其用于计算表示群集的连接组件。我们的重点是了解我们的方法下潜的可扩展性/准确性权衡：我们通过广泛的实验活动来这样做，我们使用现实生活数据集，并表明甚至粗略近似的K-NN图形是足以识别有效簇的。我们的方法是可扩展的，可以轻松调整以满足不同应用域的要求。

著录项

来源
《IEEE International Congress on Big Data》|2015年||共6页
会议地点
作者
Lulli Alessandro; Debatty Thibault; DellAmico Matteo; Michiardi Pietro; Ricci Laura;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类程序设计、软件工程;
关键词

相似文献

外文文献
中文文献
专利

1. An empirical comparison of min–max-modular k-NN with different voting methods to large-scale text categorization [J] . Ke Wu, Bao-Liang Lu, Masao Utiyama, Soft Computing . 2008,第7期

机译：最小-最大模量k-NN与不同投票方法对大规模文本分类的经验比较
2. An empirical comparison of min-max-modular k-NN with different voting methods to large-scale text categorization [J] . Wu K, Lu BL, Utiyama M, Soft computing: A fusion of foundations, methodologies and applications . 2008,第7期

机译：最小-最大模量k-NN与不同投票方法对大规模文本分类的经验比较
3. Text classification using scores based k-NN approach and term to category relevance weighting scheme [J] . Ahmed Ben Afia, Hamid Amiri International Journal of Signal and Imaging Systems Engineering . 2016,第4a5期

机译：使用基于分数的k-NN方法和术语类别相关权重方案进行文本分类
4. Scalable k-NN based text clustering [C] . Lulli Alessandro, Debatty Thibault, DellAmico Matteo, IEEE International Congress on Big Data . 2015

机译：基于可扩展k-NN的文本聚类
5. Improving k-NN Search and Subspace Clustering Based on Local Intrinsic Dimensionality [D] . Wali, Arwa M. 2018

机译：基于局部本征维数的k-NN搜索和子空间聚类
6. Comparing the effect of group- based training along with text messaging and compact disc- based training on men’s knowledge and attitude about participation in perinatal care: a cluster randomized control trial [O] . Vahideh Firouzan, Mahnaz Noroozi, Mojgan Mirghafourvand, 2020

机译：基于集团的培训和基于文本消息的培训和基于小型票据的培训对男士知识和态度的培训进行了比较：围产期护理的态度：一组随机控制试验
7. X-DMM: Fast and Scalable Model Based Text Clustering [O] . Linwei Li, Liangchen Guo, Zhenying He, 2019

机译：X-DMM：基于快速和可扩展的模型文本群集

Scalable k-NN based text clustering

摘要

著录项

相似文献

相关主题

期刊订阅