首页> 外文会议>Asia information retrieval societies conference >An MDL-Based Frequent Itemset Hierarchical Clustering Technique to Improve Query Search Results of an Individual Search Engine
【24h】

An MDL-Based Frequent Itemset Hierarchical Clustering Technique to Improve Query Search Results of an Individual Search Engine

机译:基于MDL的频繁项集层次聚类技术可改善单个搜索引擎的查询搜索结果

获取原文

摘要

In this research we propose a technique of frequent itemset hierarchical clustering (FIHC) using an MDL-based algorithm, viz KRIMP. Different from the FIHC technique, in this proposed method we define clustering as a rank sequence problem of the top-3 ranked list of each itemsets-of-keywords clusters in web documents search results of a given query to a search engine. The key idea of an MDL compression based approach is the code table. Only frequent and representative keywords as those in a KRIMP code table can be used as candidates, instead of using all important keywords from keywords extractor such as RAKE. To simulate information needs in the real world, the web documents are originated from the search results of a multi domain query. By starting in a meta-search engine environment to grab many relevant documents, we set up k = {50, 100, 200} for k-toplist retrieved documents of each search engine to build a dataset for automatic relevance judgement. We implement a clustering technique to the best individual search engine the MDL-based FIHC algorithm with setting of k = {50, 100, 200} for k-toplist of retrieved documents of each search engine, minimum support = 5 for itemset KRIMP compression, and minimum cluster support = 0.1 for FIHC clustering. Our results show that the MDL-based FIHC clustering can improve the relevance scores of web search results on an individual search engine significantly (until 39.2 % at precision P@10, k-toplist = 50).
机译:在这项研究中,我们提出了一种使用基于MDL的算法(即KRIMP)的频繁项集层次聚类(FIHC)的技术。与FIHC技术不同,在此提出的方法中,我们将聚类定义为Web文档在给定搜索引擎的搜索结果中每个关键词集的前3个排名列表的排名序列问题。基于MDL压缩的方法的关键思想是代码表。只能使用KRIMP代码表中的频繁且具有代表性的关键字作为候选项,而不是使用关键字提取器(例如RAKE)中的所有重要关键字。为了模拟现实世界中的信息需求,Web文档源自多域查询的搜索结果。通过在元搜索引擎环境中开始以获取许多相关文档,我们为每个搜索引擎的k个顶级检索文档设置了k = {50,100,200},以建立用于自动相关性判断的数据集。我们为基于MDL的FIHC算法向最佳的个人搜索引擎实施了一种聚类技术,其中每个搜索引擎的k-toplist设置k = {50,100,200},对于项目集KRIMP压缩,最小支持= 5, FIHC群集的最小群集支持= 0.1。我们的结果表明,基于MDL的FIHC聚类可以显着提高单个搜索引擎上的Web搜索结果的相关性得分(精度为P @ 10时为39.2%,k-toplist = 50)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号