首页> 外文会议>Information Retrieval Technology >On the Construction of a Large Scale Chinese Web TestCollection
【24h】

On the Construction of a Large Scale Chinese Web TestCollection

机译:论大规模中文Web测试集的构建

获取原文

摘要

The lack of a large scale Chinese test collection is an obstacle to the Chinese information retrieval development. In order to address this issue, we built such a collection composed of millions of Chinese web pages, known as the Chinese Web Test collection with 100 gigabyte (CWT100g) in data volume, which is the largest Chinese web test collection as of this writing, and has been used by several dozen research groups besides being adopted in the evaluation of the SEWM-2004 Chinese Web Track[l] and the HTRDPE-2004[2]. We present the total solution for constructing a large scale test collection like the CWT100g. Further, we found that: 1) the distribution of the number of pages within sites obeys a Zipf-like law instead of a power law proposed by Adamic and Huberman [3, 4]; 2) and an appropriate filtering method on host alias will economize resources for about 25% while crawling pages. The Zipf-like law and the method of filtering host alias proposed in the paper will facilitate both to model the Web and to perfect a search engine. Finally, we report on the results of the SEWM-2004 Chinese Web Track.
机译:缺乏大规模的中文测验集是汉语信息检索发展的障碍。为了解决这个问题,我们构建了一个由数百万个中文网页组成的集合,称为中文Web测试集合,其数据量为100 GB(CWT100g),是本文撰写时最大的中文Web测试集合,除了用于SEWM-2004中文Web Track [1]和HTRDPE-2004 [2]的评估之外,还被数十个研究小组使用。我们提出了构建像CWT100g这样的大规模测试集合的总体解决方案。此外,我们发现:1)网站内页数的分布遵循Zipf状定律,而不是Adamic和Huberman提出的幂定律[3,4]; 2)以及对主机别名的适当过滤方法,在抓取页面时将节省大约25%的资源。本文提出的类似Zipf的定律和过滤主机别名的方法将有助于为Web建模和完善搜索引擎。最后,我们报告SEWM-2004中文Web Track的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号