...
首页> 外文期刊>Concurrency and computation: practice and experience >A scalable parallel algorithm for buildingweb directories
【24h】

A scalable parallel algorithm for buildingweb directories

机译:建筑物展示设备目录的可扩展并行算法

获取原文
获取原文并翻译 | 示例
           

摘要

Web directories like Wikipedia and Open Directory Mozilla facilitate efficient information retrieval (IR) of web documents from a huge web corpus. Maintenance of these web directories is understandably a difficult task that requires manual curation by human editors or semi-automated mechanisms. Research on parallel algorithms for the automated curation of these web directories will be beneficial to the IR domain. Hence, in this article, we propose a parallel algorithm for automatically creating web directories from a corpus of web-documents. We have used centrality-based techniques to split the corpus into fine-grained clusters and subsequently an agglomeration based on locality sensitive hashing to identify coarse-grained clusters in the web-directory. Experimental results show that the algorithm generates meaningful hierarchies of the input corpus as measured by cluster-validity indices, like F-measure, rand index, and cluster purity. The algorithm achieves a significant speedup and scales well both with the number of processors and the size of the input corpus.
机译:Wikipedia和Open Directory Mozilla等Web目录促进了从庞大的Web语料库中有效的Web文档检索(IR)。维护这些Web目录是可以理解的,需要人类编辑或半自动机制手动策划的艰巨任务。对这些Web目录的自动化策算的并行算法研究将有利于IR域。因此,在本文中,我们提出了一种并行算法,用于自动从Web文档的语料库创建Web目录。我们使用基于中心的技术将语料堆分成细粒度的簇,随后基于局部敏感散列的附聚,以识别Web目录中的粗粒群集群。实验结果表明,该算法通过群集有效性指数(如F-Measure,Rand Index和Cluster Prity)生成输入语料库的有意层次结构。该算法通过处理器的数量和输入语料库的大小来实现显着的加速和尺度。

著录项

获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号