首页> 外文期刊>Computing >A parallel text clustering method using Spark and hashing
【24h】

A parallel text clustering method using Spark and hashing

机译:使用Spark和Hashing的并行文本聚类方法

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

Clustering textual data has become an important task in data analytics since several applications require to automatically organizing large amounts of textual documents into homogeneous topics. The increasing growth of available textual data from web, social networks and open platforms have challenged this task. It becomes important to design scalable clustering method able to effectively organize huge amount of textual data into topics. In this context, we propose a new parallel text clustering method based on Spark framework and hashing. The proposed method deals simultaneously with the issue of clustering huge amount of documents and the issue of high dimensionality of textual data by respectively integrating the divide and conquer approach and implementing a new document hashing strategy. These two facts have shown an important improvement of scalability and a good approximation of clustering quality results. Experiments performed on several large collections of documents have shown the effectiveness of the proposed method compared to existing ones in terms of running time and clustering accuracy.
机译:群集文本数据已成为数据分析中的重要任务,因为若干应用程序需要自动将大量文本文档组织成同一主题。来自Web,社交网络和开放平台的可用文本数据的增长越来越大挑战了这项任务。设计可扩展的聚类方法能够有效地将大量文本数据组织成主题。在此上下文中,我们提出了一种基于Spark Framework和Hashing的新的并行文本聚类方法。拟议的方法通过分别整合分配和征服方法并实施新的文档哈希策略,同时涉及聚类大量文件和文本数据的高度问题问题。这两个事实表明了可扩展性的重要提高和聚类质量结果的良好近似。在运行时间和聚类精度方面,对几个大量文件进行的实验表明了与现有的方法相比的效果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号