【24h】

Cluster Generation and Cluster Labelling for Web Snippets

机译:Web代码段的群集生成和群集标记

获取原文
获取原文并翻译 | 示例

摘要

This paper describes Armil, a meta-search engine that groups into disjoint labelled clusters the Web snippets returned by auxiliary search engines. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to her information need. Striking the right balance between running time and cluster well-formedness was a key point in the design of our system. Both the clustering and the labelling tasks are performed on the fly by processing only the snippets provided by the auxiliary search engines, and use no external sources of knowledge. Clustering is performed by means of a fast version of the furthest-point-first algorithm for metric k-center clustering. Cluster labelling is achieved by combining intra-cluster and inter-cluster term extraction based on a variant of the information gain measure. We have tested the clustering effectiveness of Armil against Vivisimo, the de facto industrial standard in Web snippet clustering, using as benchmark a comprehensive set of snippets obtained from the Open Directory Project hierarchy. According to two widely accepted "external" metrics of clustering quality, Armil achieves better performance levels by 10%. We also report the results of a thorough user evaluation of both the clustering and the cluster labelling algorithms.
机译:本文介绍了Armil,这是一个元搜索引擎,可将辅助搜索引擎返回的Web片段分组为不相交的带标签的簇。 Armil生成的集群标签为用户提供了一个简洁的指南,用于评估每个集群与其信息需求的相关性。在运行时间和集群的良好格式之间取得适当的平衡是我们系统设计的关键点。通过仅处理辅助搜索引擎提供的摘录,即可即时执行聚类和标记任务,而无需使用外部知识源。借助用于度量k中心聚类的最远点优先算法的快速版本来执行聚类。通过基于信息增益度量的变体组合集群内和集群间术语提取来实现集群标记。我们已经测试了Armil对Vivisimo的集群有效性,Vivisimo是Web片段集群中的事实上的工业标准,使用从Open Directory Project层次结构中获得的一组综合片段作为基准。根据两个广泛接受的群集质量“外部”度量标准,Armil将性能提高了10%。我们还将报告对群集和群集标记算法进行全面用户评估的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号