What's Changed? Measuring Document Change in Web Crawling for Search Engines

机译：有什么变化？在搜索引擎的Web爬网中测量文档更改

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

To provide fast, scalable search facilities, web search engines store collections locally. The collections are gathered by crawling the Web. A problem with crawling is determining when to revisit resources because they have changed: stale documents contribute towards poor search results, while unnecessary refreshing is expensive. However, some changes ― such as in images, advertisements, and headers ― axe unlikely to affect query results. In this paper, we investigate measures for determining whether documents have changed and should be recrawled. We show that content-based measures are more effective than the traditional approach of using HTTP headers. Refreshing based on HTTP headers typically recrawls 16% of the collection each day, but users do not retrieve the majority of refreshed documents. In contrast, refreshing documents when more than twenty words change recrawls 22% of the collection but updates documents more effectively. We conclude that our simple measures are an effective component of a web crawling strategy.

机译：为了提供快速，可扩展的搜索功能，Web搜索引擎将集合存储在本地。通过爬网收集这些集合。爬网的问题在于确定何时重新访问资源，因为它们已更改：过时的文档会导致较差的搜索结果，而不必要的刷新则很昂贵。但是，某些更改（例如图像，广告和标题中的更改）不太可能影响查询结果。在本文中，我们研究了确定文档是否已更改和应重新检索的措施。我们展示了基于内容的度量比使用HTTP标头的传统方法更有效。基于HTTP标头的刷新通常每天会刷新集合的16％，但是用户不会检索到大多数刷新的文档。相反，当更改了二十多个单词时，刷新文档将占集合的22％，但更有效地更新了文档。我们得出的结论是，我们的简单措施是网络爬网策略的有效组成部分。

著录项

来源
《String Processing and Information Retrieval》|2003年|p.28-42|共15页
会议地点
作者
Halil Ali; Hugh E. Williams;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类自动化技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. FOCUSED WEB CRAWLING FOR HIGH PERFORMANCE SEARCH ENGINES: ISSUES, TECHNIQUES AND SYSTEMS [J] . SUSHIL KUMAR, NARESH CHAUHAN International journal of computational intelligence theory and practice . 2020,第1期

机译：专注于高性能搜索引擎的Web爬网：问题，技术和系统
2. Search engines crawling process optimization: a webserver approach [J] . Zineddine Mhamed Internet Research: Electronic Networking Applications and Policy . 2016,第1期

机译：搜索引擎爬网过程优化：Web服务器方法
3. Does discarding XML declarations and changing file extensions improve the indexability and visibility of metadata tag names in web search engines? [J] . Sayyed Mahdi Taheri, Nadjla Hariri, Sayyed Ramatollah Fattahi Journal of Information Science . 2014,第6期

机译：丢弃XML声明和更改文件扩展名是否可以改善Web搜索引擎中元数据标记名称的可索引性和可见性？
4. What's Changed? Measuring Document Change in Web Crawling for Search Engines [C] . Halil Ali, Hugh E. Williams International Symposium on String Processing and Information Retrieval . 2003

机译：什么改变了？测量搜索引擎Web爬网的文档变化
5. Designing new crawling and indexing techniques for web search engines. [D] . Tan, Qingzhao. 2008

机译：为网络搜索引擎设计新的爬网和索引技术。
6. Policy documents as sources for measuring societal impact: how often is climate change research mentioned in policy-related documents? [O] . Lutz Bornmann, Robin Haunschild, Werner Marx -1

机译：政策文件作为衡量社会影响的来源：与政策有关的文件中多久提到一次气候变化研究？
7. nyk of the Lviv University. Series Law KEYWORDS abuse of authority, abuse of power, abuse of official status, abuse of office acts of the European Union, international legal regulation, employment, right to free movement advocacy, advocate activity, advocacy science, advocatologie, theory of advocacy appeal proceeding, grounds to judgement revision, inconsistency of the court’s findings at first instance with the actual circumstances of the criminal proceedings, cancellation or alteration of the judgment charity organization, founder, assets of charity organization, constituent documents criminal proceedings, subjects of criminal proceedings, the suspect, the suspect law, criminal procedure, international standards employer’s duty, right to the moral injury compensation, social insurance from an industrial accident, social need, labour dispute forms of the legal actions of the collective of employees historical and legal science department, scientific activity law enforcement equipment, individual legal act, the means of forming the content of the enforcement act, requirements for registration of individual legal act attributes (properties) of acts of law legal formula (construction), qualified corpus delicti of a crime, degree of social danger, crime-forming feature legal social community legal technique, technology, legal act, legal system, lawmaking legitimacy, investigation of crime, concept of criminalistics, criminalistics recommendations, tactical methods measures aimed at providing criminal proceedings, procedural sanction, monetary penalty, pre-trial investigation national implementation, forms of implementation, implementation practice, European states, international treaties participant, shareholder, partnership, acquisition, changing, suspension proof, probability, likelihood, credibility in evidence, reliability scientific school, research, development land, agricultural and environmental law, Lviv Scientific School land, agricultural and environmental law the High Council of Justice of Ukraine, the National Council of Justice of the Republic of Poland, judges’ independence, international standards of the judiciary the presumption of the labor legal personality OPEN JOURNAL SYSTEMS Journal Help USER Username Password Remember me Login NOTIFICATIONS View Subscribe LANGUAGE Select LanguageSubmit JOURNAL CONTENT Search Search Scope Search Browse By Issue By Author By Title Other Journals FONT SIZE Make font size smallerMake font size defaultMake font size larger INFORMATION For Readers For Authors For Librarians HOME ABOUT LOGIN REGISTER SEARCH CURRENT ARCHIVES ANNOUNCEMENTS Home > No 67 (2018) > Марін ONCE AGAIN ON THE RETROACTIVE EFFECT OF THE CRIMINAL LAW IN TIME IN THE ASPECT OF INDIRECT CRIMINALIZATION [O] . Oleksandr Marin 2018

机译：LVIV大学的尼克。系列律师关键词滥用权威，滥用权力，滥用官方地位，滥用欧盟办公室行为，国际法律监管，就业，自由运动倡导，倡导活动，倡导科学，倡导性宣传理论，倡导呼吁的理论，理由修改，法院调查结果的不一致事实，判决慈善组织的实际情况，取消或改变判决慈善机构，慈善组织的创始人，慈善机构资产，组成文件刑事诉讼，刑事诉讼主题，嫌疑人，嫌疑法，刑事诉讼，国际标准雇主的责任，道德伤害赔偿权，社会保险从工业事故，社会需求，劳动争端形式的员工历史和法律科学部的法律行为，科学活动执法设备，个人法律法案，制定执法法案的内容，法律法律公式（建设）行为的个人法律法案（属性），犯罪的合格犯罪，社会危险程度，犯罪形成特征法律社会社会法律技术，技术，法律法，法律制度，立法合法性，犯罪调查，犯罪概念，犯罪建议，战术方法旨在提供刑事诉讼，程序制裁，货币罚款，预审调查预审国家实施，执行形式，实施实践，欧洲国家，国际条约参与者，股东，伙伴关系，收购，不断变化，暂停证明，概率，可能性，可信度在证据，可靠性科学学校，研究，开发土地，农业和环境法，利沃科学校园，农业和环境法律议理乌克兰的冰，波兰共和国国家司法委员会，法官的独立，国际标准的司法部门劳动法人的推定开放期刊系统期刊帮助用户用户名密码记住我登录通知查看订阅语言选择lobsumberubmit期刊内容搜索搜索范围搜索按问题浏览作者标题在间接刑事定罪方面，再次对刑法的追溯效应及时
8. Impacts of Climate Change and Variability on Transportation Systems and Infrastructure: The Gulf Coast Study, Phase 2. Task 3.2: Engineering Assessments of Climate Change Impacts and Adaptation Measures. [R] . 2014

机译：气候变化和变率对交通系统和基础设施的影响：墨西哥湾沿岸研究，第2阶段。任务3.2：气候变化影响和适应措施的工程评估。

What's Changed? Measuring Document Change in Web Crawling for Search Engines

摘要

著录项

相似文献

相关主题

期刊订阅