A framework for hierarchical clustering based indexing in search engines

机译：用于搜索引擎中基于层次聚类的索引的框架

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Granting efficient and fast accesses to the index is a key issue for performances of Web Search Engines. In order to enhance memory utilization and favor fast query resolution, WSEs use Inverted File (IF) indexes that consist of an array of the posting lists where each posting list is associated with a term and contains the term as well as the identifiers of the documents containing the term. Since the document identifiers are stored in sorted order, they can be stored as the difference between the successive documents so as to reduce the size of the index. This paper describes a clustering algorithm that aims at partitioning the set of documents into ordered clusters so that the documents within the same cluster are similar and are being assigned the closer document identifiers. Thus the average value of the differences between the successive documents will be minimized and hence storage space would be saved. The paper further presents the extension of this clustering algorithm to be applied for the hierarchical clustering in which similar clusters are clubbed to form a mega cluster and similar mega clusters are then combined to form super cluster. Thus the paper describes the different levels of clustering which optimizes the search process by directing the search to a specific path from higher levels of clustering to the lower levels i.e. from super clusters to mega clusters, then to clusters and finally to the individual documents so that the user gets the best possible matching results in minimum possible time.

机译：授予对索引的有效和快速访问是Web搜索引擎性能的关键问题。为了提高内存利用率并支持快速查询解析，WSE使用倒排文件（IF）索引，该索引由发布列表的数组组成，其中每个发布列表都与一个术语相关联，并包含该术语以及文档的标识符。包含该术语。由于文档标识符按排序顺序存储，因此可以将它们存储为连续文档之间的差异，以减小索引的大小。本文介绍了一种聚类算法，该算法旨在将文档集划分为有序的聚类，以使同一聚类中的文档相似并且被分配了更接近的文档标识符。因此，连续文档之间的差异的平均值将被最小化，从而将节省存储空间。本文进一步提出了该聚类算法的扩展，将其应用于分层聚类，在该聚类中，类似的聚类被聚类成一个巨型聚类，然后将相似的多个聚类组合成一个超级聚类。因此，本文描述了不同层次的聚类，通过将搜索定向到从较高层次的聚类到较低层次的特定路径（即从超级聚类到大型聚类，再到聚类，最后到单个文档），从而优化了搜索过程。用户可以在最短的时间内获得最佳的匹配结果。

著录项

来源
《2010 1st International Conference on Parallel Distributed and Grid Computing》|2010年|p.372-377|共6页
会议地点
作者

展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类网络计算机（NC）;计算机网络;
关键词
Document Identifiers Assignment; Hierarchical Clustering; Index compression; Inverted files;

机译：文档标识符分配;层次聚类;索引压缩;文件倒置;

相似文献

外文文献
中文文献
专利

1. A framework for utilising usage trends in the crawling and indexing process of search engines [J] . Neelam Duhan, A.K. Sharma International journal of knowledge and web intelligence . 2011,第4期

机译：一个在搜索引擎的爬网和索引过程中利用使用趋势的框架
2. A Framework for Multilevel Indexing in Search Engines [J] . Parul Gupta, A. K. Gupta International Journal of Applied Engineering Research . 2009,第8期

机译：搜索引擎中的多级索引框架
3. A personalized search engine based on Web-snippet hierarchical clustering [J] . P. Ferragina, A. Gulli Software . 2008,第2期

机译：基于Web片段层次聚类的个性化搜索引擎
4. A framework for hierarchical clustering based indexing in search engines [C] . {missing} International Conference on Parallel Distributed and Grid Computing . 2010

机译：基于分层群集的搜索引擎索引的框架
5. Efficient indexing and query processing in distributed search engines. [D] . Zhang, Jiangong. 2008

机译：分布式搜索引擎中的高效索引和查询处理。
6. Feasibility of feature-based indexing clustering and search of clinical trials: A case study of breast cancer trials from ClinicalTrials.gov [O] . Mary Regina Boland, Riccardo Miotto, Junfeng Gao, -1

机译：基于特征的索引聚类和临床试验搜索的可行性：来自ClinicalTrials.gov的乳腺癌试验案例研究
7. Personalized Tag based Image based Search Engines using clustering and similarity Indexing [O] . Tejwant Telang 2018

机译：基于个性化标签的基于图像的搜索引擎使用群集和相似性索引
8. NASA Indexing Benchmarks: Evaluating Test Search Engines [R] . Esler, S. L. , Nelson, M. L. 2004

机译：Nasa索引基准：评估测试搜索引擎

A framework for hierarchical clustering based indexing in search engines

摘要

著录项

相似文献

相关主题

期刊订阅