首页> 外文会议>2010 1st International Conference on Parallel Distributed and Grid Computing >A framework for hierarchical clustering based indexing in search engines
【24h】

A framework for hierarchical clustering based indexing in search engines

机译:用于搜索引擎中基于层次聚类的索引的框架

获取原文

摘要

Granting efficient and fast accesses to the index is a key issue for performances of Web Search Engines. In order to enhance memory utilization and favor fast query resolution, WSEs use Inverted File (IF) indexes that consist of an array of the posting lists where each posting list is associated with a term and contains the term as well as the identifiers of the documents containing the term. Since the document identifiers are stored in sorted order, they can be stored as the difference between the successive documents so as to reduce the size of the index. This paper describes a clustering algorithm that aims at partitioning the set of documents into ordered clusters so that the documents within the same cluster are similar and are being assigned the closer document identifiers. Thus the average value of the differences between the successive documents will be minimized and hence storage space would be saved. The paper further presents the extension of this clustering algorithm to be applied for the hierarchical clustering in which similar clusters are clubbed to form a mega cluster and similar mega clusters are then combined to form super cluster. Thus the paper describes the different levels of clustering which optimizes the search process by directing the search to a specific path from higher levels of clustering to the lower levels i.e. from super clusters to mega clusters, then to clusters and finally to the individual documents so that the user gets the best possible matching results in minimum possible time.
机译:授予对索引的有效和快速访问是Web搜索引擎性能的关键问题。为了提高内存利用率并支持快速查询解析,WSE使用倒排文件(IF)索引,该索引由发布列表的数组组成,其中每个发布列表都与一个术语相关联,并包含该术语以及文档的标识符。包含该术语。由于文档标识符按排序顺序存储,因此可以将它们存储为连续文档之间的差异,以减小索引的大小。本文介绍了一种聚类算法,该算法旨在将文档集划分为有序的聚类,以使同一聚类中的文档相似并且被分配了更接近的文档标识符。因此,连续文档之间的差异的平均值将被最小化,从而将节省存储空间。本文进一步提出了该聚类算法的扩展,将其应用于分层聚类,在该聚类中,类似的聚类被聚类成一个巨型聚类,然后将相似的多个聚类组合成一个超级聚类。因此,本文描述了不同层次的聚类,通过将搜索定向到从较高层次的聚类到较低层次的特定路径(即从超级聚类到大型聚类,再到聚类,最后到单个文档),从而优化了搜索过程。用户可以在最短的时间内获得最佳的匹配结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号