首页> 外文会议>International Conference on Applications of Natural Language to Information Systems >Exploiting Multiple Features with MEMMs for Focused Web Crawling
【24h】

Exploiting Multiple Features with MEMMs for Focused Web Crawling

机译:利用MEMMS的多个功能,用于聚焦Web爬网

获取原文
获取外文期刊封面目录资料

摘要

Focused web crawling traverses the Web to collect documents on a specific topic. This is not an easy task, since focused crawlers need to identify the next most promising link to follow based on the topic and the content and links of previously crawled pages. In this paper, we present a framework based on Maximum Entropy Markov Models (MEMMs) for an enhanced focused web crawler to take advantage of richer representations of multiple features extracted from Web pages, such as anchor text and the keywords embedded in the link URL, to represent useful context. The key idea of our approach is to treat the focused web crawling problem as a sequential task and use a combination of content analysis and link structure to capture sequential patterns leading to targets. The experimental results showed that focused crawling using MEMMs is a very competitive crawler in general over Best-First crawling on Web Data in terms of two metrics: Precision and Maximum Average Similarity.
机译:集中的Web爬网遍历Web以收集特定主题的文档。这不是一项简单的任务,因为基于主题以及先前爬网页的内容以及内容和链接,重点爬虫需要识别下一个最有前途的链接。在本文中,我们提出了一种基于最大熵马尔可夫模型(MEMM)的框架,用于增强的聚焦Web爬网程序,以利用从网页提取的多个功能的富裕表示,例如锚文本和链接URL中的关键字,代表有用的上下文。我们方法的关键思想是将聚焦的Web爬网问题视为顺序任务,并使用内容分析和链路结构的组合来捕获导致目标的顺序模式。实验结果表明,使用Memms的聚焦爬行是一般竞争激烈的履历,一般在两个度量的Web数据上最佳爬行:精度和最大平均相似度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号