Learning page-independent heuristics for extracting data from Web pages

机译：学习与页面无关的启发式方法，以从Web页面提取数据

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

One bottleneck in implementing a system that intelligently queries the Web is developing 'wrappers' -- programs that extract data from Web pages. Here we describe a method for learning general, page-independent heuristics for extracting data from HTML documents. The input to our learning system is a set of working wrapper programs, paired with HTML pages they correctly wrap. The output is a general procedure for extracting data that works for many formats and many pages. In experiments with a collection of 84 constrained but realistic extraction problems, we demonstrate that 30precent of the problems can be handled perfectly by learned extraction heuristics, and around 50precent can be handled acceptably. We also demonstrate that learned page-independent extraction heuristics can substantially improve the performance of methods for learning page-specific wrappers.

机译：实现智能查询Web的系统的一个瓶颈是开发“包装程序”（wrapper），即从网页提取数据的程序。在这里，我们描述了一种用于学习一般的，与页面无关的启发式方法的方法，该启发式方法用于从HTML文档中提取数据。我们学习系统的输入是一组有效的包装程序，以及正确包装的HTML页面。输出是提取可用于多种格式和许多页面的数据的通用过程。通过对84个受约束但现实的提取问题进行的实验，我们证明了30％的问题可以通过学习的提取试探法完美地解决，大约50％的问题可以被接受。我们还证明，独立于页面的提取启发式学习可以大大提高学习特定于页面的包装器的方法的性能。

著录项

来源
《Proceedings of the Eighth international world wide web conference》|1999年|p.563-574|共12页
会议地点 Toronto(CA);Toronto(CA)
作者
William W. Cohen; Wei Fan;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;
关键词
Information integration; Machine learning; Extraction;

机译：信息整合；机器学习；萃取;

相似文献

外文文献
中文文献
专利

1. Learning page-independent heuristics for extracting data from Web pages [J] . William W. Cohen, Wei Fan Computer Networks . 1999,第11a16期

机译：学习与页面无关的启发式方法，以从Web页面提取数据
2. Leaming page-independent heuristics for extracting data from Web pages [J] . William W Cohen, Wei Fan Computer Networks . 1999,第11a16期

机译：从网页中提取与页面无关的启发式方法
3. Extracting Web Data Using Instance-Based Learning [J] . Yanhong Zhai, Bing Liu World Wide Web . 2007,第2期

机译：使用基于实例的学习提取Web数据
4. Learning page-independent heuristics for extracting data from Web pages [C] . William W. Cohen, Wei Fan International world wide web conference . 1999

机译：学习独立于网页中提取数据的独立启发式
5. Extracting and managing structured web data. [D] . Cafarella, Michael John. 2009

机译：提取和管理结构化的Web数据。
6. A machine learning heuristic to identify biologically relevant and minimal biomarker panels from omics data [O] . Anna L Swan, Dov J Stekel, Charlie Hodgman, 2015

机译：机器学习启发式算法可从组学数据中识别生物学上相关且最小的生物标志物
7. Learning page-independent heuristics for extracting data from web pages [O] . William Cohen, Wei Fan 1999

机译：学习与页面无关的启发式方法，用于从网页中提取数据
8. Learning to Extract Symbolic Knowledge from the World Wide Web [R] . Craven, M. , McCallum, A. , PiPasquo, D. , 1998

机译：学习从万维网中提取符号知识

Learning page-independent heuristics for extracting data from Web pages

摘要

著录项

相似文献

相关主题

期刊订阅