首页> 外文会议>2011 Eighth Web Information Systems and Applications Conference >Clustering of Web Search Results Based on Combination of Links and In-Snippets
【24h】

Clustering of Web Search Results Based on Combination of Links and In-Snippets

机译:基于链接和摘录组合的Web搜索结果聚类

获取原文

摘要

Search engine is a common tool to retrieve the information in the Web. But the current status of returned results is still far from satisfaction. Users have to be confronted with searching for a long result list to get the information really wanted. Many works focused on the post processing search results to facilitate users to examine the results. One of the common ways of post processing search result is clustering. Term-based clustering appears as first way to cluster the results. But this method is suffering from the poor quality while the processed pages have little text. Link-based clustering can conquer this problem. But the quality of clusters heavily depends on the number of in-links and out-links in common. In this paper, we propose that the short text attached to in-link is valuable information and it is helpful to reach high clustering quality. To distinguish them with general snippet, we name it as in-snippet. Based on the in-snippet, we propose a new clustering method that combines the links and the in-snippets together. In our method, similarity between pages consists of two parts : link similarity and term similarity. We designed related algorithm to implement clustering. In order to prevent bias from human judgments, the experiment datasets are collected from Open Directory Project(DMOZ). Due to DMOZ is human-edited directory, the datasets from DMOZ has higher quality and larger scale. We use entropy and f-measure to evaluate the quality of the final clusters. By being compared with the link-based and the pure term-based algorithms, our method outperforms others in clustering quality.
机译:搜索引擎是在Web上检索信息的常用工具。但是,返回结果的当前状态仍然远远不能令人满意。用户必须面对搜索很长的结果列表才能获得真正想要的信息。许多作品专注于后处理搜索结果,以方便用户检查结果。后处理搜索结果的常见方法之一是聚类。基于术语的聚类是将结果聚类的第一种方法。但是,这种方法的质量很差,而处理过的页面几乎没有文字。基于链接的群集可以解决此问题。但是群集的质量在很大程度上取决于共同的入站和出站数量。在本文中,我们建议链接中附加的短文本是有价值的信息,有助于达到较高的聚类质量。为了将它们与一般代码段区分开,我们将其命名为代码段内。基于摘要,我们提出了一种将链接和摘要结合在一起的新聚类方法。在我们的方法中,页面之间的相似度由两部分组成:链接相似度和术语相似度。我们设计了相关的算法来实现聚类。为了防止人为判断产生偏差,从开放目录项目(DMOZ)收集了实验数据集。由于DMOZ是人工编辑的目录,因此DMOZ的数据集具有更高的质量和更大的规模。我们使用熵和f测度来评估最终聚类的质量。通过与基于链接的算法和基于纯术语的算法进行比较,我们的方法在聚类质量方面优于其他方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号