Exploiting Multiple Features with MEMMs for Focused Web Crawling

机译：利用MEMMS的多个功能，用于聚焦Web爬网

获取原文

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Focused web crawling traverses the Web to collect documents on a specific topic. This is not an easy task, since focused crawlers need to identify the next most promising link to follow based on the topic and the content and links of previously crawled pages. In this paper, we present a framework based on Maximum Entropy Markov Models (MEMMs) for an enhanced focused web crawler to take advantage of richer representations of multiple features extracted from Web pages, such as anchor text and the keywords embedded in the link URL, to represent useful context. The key idea of our approach is to treat the focused web crawling problem as a sequential task and use a combination of content analysis and link structure to capture sequential patterns leading to targets. The experimental results showed that focused crawling using MEMMs is a very competitive crawler in general over Best-First crawling on Web Data in terms of two metrics: Precision and Maximum Average Similarity.

机译：集中的Web爬网遍历Web以收集特定主题的文档。这不是一项简单的任务，因为基于主题以及先前爬网页的内容以及内容和链接，重点爬虫需要识别下一个最有前途的链接。在本文中，我们提出了一种基于最大熵马尔可夫模型（MEMM）的框架，用于增强的聚焦Web爬网程序，以利用从网页提取的多个功能的富裕表示，例如锚文本和链接URL中的关键字，代表有用的上下文。我们方法的关键思想是将聚焦的Web爬网问题视为顺序任务，并使用内容分析和链路结构的组合来捕获导致目标的顺序模式。实验结果表明，使用Memms的聚焦爬行是一般竞争激烈的履历，一般在两个度量的Web数据上最佳爬行：精度和最大平均相似度。

著录项

来源
《International Conference on Applications of Natural Language to Information Systems》|2008年||共12页
会议地点
作者
Hongyu Liu; Evangelos Milios; Larry Korba;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP3-53;
关键词
Focused Crawling; Web Search; Feature Selection; MEMMs;

机译：重点爬行;网页搜索;特征选择;模因;

相似文献

外文文献
中文文献
专利

1. Keyword weight optimization using gradient strategies in event focused web crawling [J] . Rajiv S., Navaneethan C. Pattern recognition letters . 2021,第Feba期

机译：关键词权重优化在活动中使用渐变策略的重点策略
2. FOCUSED WEB CRAWLING FOR HIGH PERFORMANCE SEARCH ENGINES: ISSUES, TECHNIQUES AND SYSTEMS [J] . SUSHIL KUMAR, NARESH CHAUHAN International journal of computational intelligence theory and practice . 2020,第1期

机译：专注于高性能搜索引擎的Web爬网：问题，技术和系统
3. Focused crawling for the hidden web [J] . F. Can Computing reviews . 2017,第1期

机译：集中抓取隐藏的网页
4. Exploiting Multiple Features with MEMMs for Focused Web Crawling [C] . Hongyu Liu, Evangelos Milios, Larry Korba Natural Language Processing and Information Systems . 2008

机译：利用MEMM利用多种功能进行集中式Web爬网
5. Connecting link structure and content on the Web for effective focused crawling. [D] . Nickerson, Adam Stuart. 2003

机译：连接Web上的链接结构和内容，以进行有效的集中爬网。
6. Domain adaptation of statistical machine translation with domain-focused web crawling [O] . Pavel Pecina, Antonio Toral, Vassilis Papavassiliou, -1

机译：统计机器翻译的领域适应和以领域为中心的网络爬网
7. Exploiting Multiple Features with MEMMs for Focused Web Crawling [O] . Hongyu Liu, Evangelos Milios, Larry Korba 2010

机译：利用MEMM利用多种功能进行集中式Web爬网
8. Focused Crawling of the Deep Web Using Service Class Descriptions [R] . Rocco, D., Liu, L., Critchlow, T. 2005

机译：使用服务类描述重点对Deep Web进行爬网

Exploiting Multiple Features with MEMMs for Focused Web Crawling

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅