A scalable parallel algorithm for buildingweb directories

Seshadri Karthick; Maruthappan Aswin; Sundar Raman Mukunthapriya

首页> 外文期刊>Concurrency and computation: practice and experience >A scalable parallel algorithm for buildingweb directories

【24h】

A scalable parallel algorithm for buildingweb directories

机译：建筑物展示设备目录的可扩展并行算法

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相关主题

摘要

Web directories like Wikipedia and Open Directory Mozilla facilitate efficient information retrieval (IR) of web documents from a huge web corpus. Maintenance of these web directories is understandably a difficult task that requires manual curation by human editors or semi-automated mechanisms. Research on parallel algorithms for the automated curation of these web directories will be beneficial to the IR domain. Hence, in this article, we propose a parallel algorithm for automatically creating web directories from a corpus of web-documents. We have used centrality-based techniques to split the corpus into fine-grained clusters and subsequently an agglomeration based on locality sensitive hashing to identify coarse-grained clusters in the web-directory. Experimental results show that the algorithm generates meaningful hierarchies of the input corpus as measured by cluster-validity indices, like F-measure, rand index, and cluster purity. The algorithm achieves a significant speedup and scales well both with the number of processors and the size of the input corpus.

机译：Wikipedia和Open Directory Mozilla等Web目录促进了从庞大的Web语料库中有效的Web文档检索（IR）。维护这些Web目录是可以理解的，需要人类编辑或半自动机制手动策划的艰巨任务。对这些Web目录的自动化策算的并行算法研究将有利于IR域。因此，在本文中，我们提出了一种并行算法，用于自动从Web文档的语料库创建Web目录。我们使用基于中心的技术将语料堆分成细粒度的簇，随后基于局部敏感散列的附聚，以识别Web目录中的粗粒群集群。实验结果表明，该算法通过群集有效性指数（如F-Measure，Rand Index和Cluster Prity）生成输入语料库的有意层次结构。该算法通过处理器的数量和输入语料库的大小来实现显着的加速和尺度。

著录项

来源
《Concurrency and computation: practice and experience》 |2021年第9期|e6121.1-e6121.23|共23页
作者
Seshadri Karthick; Maruthappan Aswin; Sundar Raman Mukunthapriya;
展开▼
作者单位

Natl Inst Technol Dept Comp Sci & Engn Tadepalligudem Andhra Pradesh India;

Verizon Media New York NY USA;

Univ Southern Calif Los Angeles CA 90007 USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
computational intelligence#8208; concurrent computing; high performance computing; knowledge engineering#8208; knowledge discovery; learning systems#8208; supervised learning; parallel programming; text mining; web mining;

机译：计算智能 - 并发计算;高性能计算;知识工程知识发现;学习系统监督学习;并行编程;文本挖掘;网站挖掘;网站挖掘;

A scalable parallel algorithm for buildingweb directories

摘要

著录项

相关主题

期刊订阅