On the Construction of a Large Scale Chinese Web TestCollection

机译：论大规模中文Web测试集的构建

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The lack of a large scale Chinese test collection is an obstacle to the Chinese information retrieval development. In order to address this issue, we built such a collection composed of millions of Chinese web pages, known as the Chinese Web Test collection with 100 gigabyte (CWT100g) in data volume, which is the largest Chinese web test collection as of this writing, and has been used by several dozen research groups besides being adopted in the evaluation of the SEWM-2004 Chinese Web Track[l] and the HTRDPE-2004[2]. We present the total solution for constructing a large scale test collection like the CWT100g. Further, we found that: 1) the distribution of the number of pages within sites obeys a Zipf-like law instead of a power law proposed by Adamic and Huberman [3, 4]; 2) and an appropriate filtering method on host alias will economize resources for about 25% while crawling pages. The Zipf-like law and the method of filtering host alias proposed in the paper will facilitate both to model the Web and to perfect a search engine. Finally, we report on the results of the SEWM-2004 Chinese Web Track.

机译：缺乏大规模的中文测验集是汉语信息检索发展的障碍。为了解决这个问题，我们构建了一个由数百万个中文网页组成的集合，称为中文Web测试集合，其数据量为100 GB（CWT100g），是本文撰写时最大的中文Web测试集合，除了用于SEWM-2004中文Web Track [1]和HTRDPE-2004 [2]的评估之外，还被数十个研究小组使用。我们提出了构建像CWT100g这样的大规模测试集合的总体解决方案。此外，我们发现：1）网站内页数的分布遵循Zipf状定律，而不是Adamic和Huberman提出的幂定律[3，4]; 2）以及对主机别名的适当过滤方法，在抓取页面时将节省大约25％的资源。本文提出的类似Zipf的定律和过滤主机别名的方法将有助于为Web建模和完善搜索引擎。最后，我们报告SEWM-2004中文Web Track的结果。

著录项

来源
《Information Retrieval Technology》|2008年|P.117-128|共12页
会议地点
作者
Hongfei Yan; Chong Chen; Bo Peng; Xiaoming Li;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算机设备安全;
关键词
test collection; documents; zipf-like law;

机译：测试集;文档;类似zipf的法律;

相似文献

外文文献
中文文献
专利

1. THE IMPACT OF ORGANISATIONAL RESILIENCE ON CONSTRUCTION PROJECT SUCCESS: EVIDENCE FROM LARGE-SCALE CONSTRUCTION IN CHINA [J] . Yang Jie, Cheng Qian Journal of Civil Engineering and Management . 2020,第8期

机译：组织综合力对建设项目成功的影响：中国大规模建设的证据
2. Toward More Comprehensive Chinese Internet Users' Studies: Translation and Validation of the Chinese-Mandarin Version of the 8-ltem Information Retrieval on the Web Self-Efficacy Scale (Ch-IROWSE) [J] . Rodon Carole, Chevalier Aline International journal of human-computer interaction . 2017,第10a12期

机译：走向更全面的中国互联网用户研究：基于Web自我效能感量表（Ch-IROWSE）的8-ltem信息检索的汉语普通话版本的翻译和验证
3. China Starts Large Scale Railways Construction/China Customs Tariff Revenue Exceeds RMB460 Billion in the First Three Seasons [J] . 中国对外贸易（英文版） . 2006,第020期

机译：中国开始大规模铁路建设/前三季度中国海关关税收入超过4600亿元人民币
4. On the Construction of a Large Scale Chinese Web Test Collection [C] . Hongfei Yan, Chong Chen, Bo Peng, 4th Asia Information Retrieval Symposium(AIRS 2008)（第四届亚洲信息检索研讨会）论文集 . 2008

机译：论大规模中文网络测试题库的建设
5. A Web-based interactive instructional system for architectural historical-construction education: Using the traditional Chinese construction system as a case study [D] . Huang, Yan. 2001

机译：基于Web的建筑历史建筑教育交互式教学系统：以中国传统建筑系统为例
6. Semi-Automatic Construction of the Chinese-English MeSH Using Web-BasedTerm Translation Method [O] . Wen-Hsiang Lu, Shih-Jui Lin, Yi-Che Chan, 2005

机译：基于Web的汉英MeSH半自动构建术语翻译法
7. On the Construction of a Large Scale Chinese Web Test Collection [O] . Hongfei Yan, Chong Chen, Bo Peng, 2016

机译：论大型中文网络考试集的构建

On the Construction of a Large Scale Chinese Web TestCollection

摘要

著录项

相似文献

相关主题

期刊订阅