首页> 外文会议>Artificial intelligence applications and innovations >CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce
【24h】

CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce

机译:CSMR:具有余弦相似度和MapReduce的文本聚类的可扩展算法

获取原文
获取原文并翻译 | 示例

摘要

As Internet develops rapidly huge amounts of texts need to be processed in a short time. This entails the necessity of fast, scalable methods for text processing. In this paper a method for pairwise text similarity on massive data-sets, using the Cosine Similarity metric and the tf-idf (Term Frequency-Inverse Document Frequency) normalization method is proposed. The research approach is mainly focused on the MapReduce paradigm, a model for processing large data-sets in parallel manner, with a distributed algorithm on computer clusters. Through MapReduce model application on each step of the proposed method, text processing speed and scalability is enhanced in reference to other traditional methods. The CSMR (Cosine Similarity with MapReduce) method's implementation is currently at the implementation stage. Precise and analytical conclusions concerning the efficiency of the proposed method are to be reached upon completion and review of the overall project phases.
机译:随着Internet的快速发展,需要在短时间内处理大量文本。这就需要快速,可扩展的文本处理方法。本文提出了一种使用余弦相似度度量和tf-idf(词频-文档频率倒数)归一化方法对海量数据集进行成对文本相似度的方法。研究方法主要集中在MapReduce范式上,该模型是一种并行处理大型数据集的模型,在计算机集群上具有分布式算法。通过在提出的方法的每个步骤上应用MapReduce模型,与其他传统方法相比,可以提高文本处理速度和可伸缩性。 CSMR(带有MapReduce的余弦相似度)方法的实现当前处于实现阶段。在完成和审查整个项目阶段后,将得出有关所提出方法效率的精确和分析性结论。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号