PROBABILISTIC MODELS FOR FOCUSED WEB CRAWLING

HONGYU LIU; EVANGELOS MILIOS

首页> 外文期刊>Computational Intelligence >PROBABILISTIC MODELS FOR FOCUSED WEB CRAWLING

【24h】

PROBABILISTIC MODELS FOR FOCUSED WEB CRAWLING

机译：重点网页抓取的概率模型

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

A focused crawler is an efficient tool used to traverse the Web to gather documents on a specific topic. It can be used to build domain-specific Web search portals and online personalized search tools. Focused crawlers can only use information obtained from previously crawled pages to estimate the relevance of a newly seen URL. Therefore, good performance depends on powerful modeling of context as well as the quality of the current observations. To address this challenge, we propose capturing sequential patterns along paths leading to targets based on probabilistic models. We model the process of crawling by a walk along an underlying chain of hidden states, defined by hop distance from target pages, from which the actual topics of the documents are observed. When a new document is seen, prediction amounts to estimating the distance of this document from a target. Within this framework, we propose two probabilistic models for focused crawling, Maximum Entropy Markov Model (MEMM) and Linear-chain Conditional Random Field (CRF). With MEMM, we exploit multiple overlapping features, such as anchor text, to represent useful context and form a chain of local classifier models. With CRF, a form of undirected graphical models, we focus on obtaining global optimal solutions along the sequences by taking advantage not only of text content, but also of linkage relations. We conclude with an experimental validation and comparison with focused crawling based on Best-First Search (BFS), Hidden Markov Model (HMM), and Context-graph Search (CGS).

机译：聚焦爬虫是一种高效的工具，可用于遍历Web来收集有关特定主题的文档。它可用于构建特定于域的Web搜索门户和在线个性化搜索工具。重点爬网程序只能使用从先前爬网的页面获得的信息来估计新看到的URL的相关性。因此，好的性能取决于强大的上下文建模以及当前观察的质量。为了应对这一挑战，我们建议根据概率模型沿通向目标的路径捕获顺序模式。我们对沿着隐藏状态的基础链进行爬网的过程进行建模，该状态由到目标页面的跳距定义，从中可以观察到文档的实际主题。当看到新文档时，预测等于估算此文档与目标的距离。在此框架内，我们针对聚焦爬行提出了两个概率模型：最大熵马尔可夫模型（MEMM）和线性链条件随机场（CRF）。使用MEMM，我们可以利用多个重叠的功能（例如锚文本）来表示有用的上下文并形成一系列本地分类器模型。通过CRF（一种无向图形模型），我们不仅致力于利用文本内容，而且还利用链接关系，着重于沿序列获得全局最优解。我们以基于最佳优先搜索（BFS），隐马尔可夫模型（HMM）和上下文图搜索（CGS）的集中爬网进行实验验证和比较为结论。

著录项

来源
《Computational Intelligence》 |2012年第3期|p.289-328|共40页
作者
HONGYU LIU; EVANGELOS MILIOS;
展开▼
作者单位

National Research Council of Canada, Institute for Information Technology Fredericton, New Brunswick, Canada Faculty of Computer Science, Dalhousie University, Halifax, Nova Scotia,Canada;

Faculty of Computer Science, Dalhousie University, Halifax, Nova Scotia, Canada;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
topical crawlers; focused crawlers; web mining; machine learning; text and link analysis;

机译：局部爬虫;集中爬虫;网络挖掘;机器学习文字和链接分析;

相似文献

外文文献
中文文献
专利

1. Keyword weight optimization using gradient strategies in event focused web crawling [J] . Rajiv S., Navaneethan C. Pattern recognition letters . 2021,第Feba期

机译：关键词权重优化在活动中使用渐变策略的重点策略
2. FOCUSED WEB CRAWLING FOR HIGH PERFORMANCE SEARCH ENGINES: ISSUES, TECHNIQUES AND SYSTEMS [J] . SUSHIL KUMAR, NARESH CHAUHAN International journal of computational intelligence theory and practice . 2020,第1期

机译：专注于高性能搜索引擎的Web爬网：问题，技术和系统
3. Focused crawling for the hidden web [J] . F. Can Computing reviews . 2017,第1期

机译：集中抓取隐藏的网页
4. Probabilistic models for focused web crawling [C] . Hongyu Liu, Evangelos Milios, Jeannette Janssen, Annual ACM international workshop on Web information and data management;ACM international workshop on Web information and data management . 2004

机译：用于集中式Web爬行的概率模型
5. Connecting link structure and content on the Web for effective focused crawling. [D] . Nickerson, Adam Stuart. 2003

机译：连接Web上的链接结构和内容，以进行有效的集中爬网。
6. Domain adaptation of statistical machine translation with domain-focused web crawling [O] . Pavel Pecina, Antonio Toral, Vassilis Papavassiliou, -1

机译：统计机器翻译的领域适应和以领域为中心的网络爬网
7. Probabilistic models for focused web crawling [O] . Liu, H., Milios, E. 2012

机译：用于集中式Web爬行的概率模型
8. Focused Crawling of the Deep Web Using Service Class Descriptions [R] . Rocco, D., Liu, L., Critchlow, T. 2005

机译：使用服务类描述重点对Deep Web进行爬网

PROBABILISTIC MODELS FOR FOCUSED WEB CRAWLING

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅