Learning page-independent heuristics for extracting data from Web pages

机译：学习独立于网页中提取数据的独立启发式

获取原文

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

One bottleneck in implementing a system that intelligently queries the Web is developing 'wrappers' - programs that extract data from Web pages. Here we describe a method for learning general, page-independent heuristics for extracting data from HTL documents. The input to our learning system is a set of working wrapper programs, paired with HTML pages they correctly wrap. The output is a general procedure for extracting data that works for many formats and many pages. In experiments with a collection of 84 constrained but realistic extraction problems, we demonstrate that 30% of the problems can be handled perfectly by learned extraction heuristics, and around 50% can be handled acceptable. We also demonstrate that learned page-independent extraction heuristics can substantially improve the performance of methods for learning page-specific wrappers.

机译：实现智能查询Web的系统的一个瓶颈正在开发“包装器” - 从网页中提取数据的程序。在这里，我们描述了一种学习一般，页面无关的启发式的方法，用于从HTL文档中提取数据。我们学习系统的输入是一组工作包装程序，与他们正确包装的HTML页面配对。输出是提取适用于多种格式和许多页面的数据的一般过程。在实验中，在一个受限制但逼真的提取问题的收集中，我们证明了30％的问题可以通过学习的提取启发式完全处理，可以接受约50％。我们还证明了学习的页面独立的提取启发式机会可以大大提高学习Page特定的包装方法的方法的性能。

著录项

来源
《International world wide web conference》|1999年||共12页
会议地点
作者
William W. Cohen; Wei Fan;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算机的应用;
关键词
information integration; machine learning; extraction;

机译：信息集成;机器学习;提取;

相似文献

外文文献
中文文献
专利

1. Learning page-independent heuristics for extracting data from Web pages [J] . William W. Cohen, Wei Fan Computer Networks . 1999,第11a16期

机译：学习与页面无关的启发式方法，以从Web页面提取数据
2. Leaming page-independent heuristics for extracting data from Web pages [J] . William W Cohen, Wei Fan Computer Networks . 1999,第11a16期

机译：从网页中提取与页面无关的启发式方法
3. Extracting Web Data Using Instance-Based Learning [J] . Yanhong Zhai, Bing Liu World Wide Web . 2007,第2期

机译：使用基于实例的学习提取Web数据
4. Learning page-independent heuristics for extracting data from Web pages [C] . William W. Cohen, Wei Fan International world wide web conference . 1999

机译：学习独立于网页中提取数据的独立启发式
5. Extracting and managing structured web data. [D] . Cafarella, Michael John. 2009

机译：提取和管理结构化的Web数据。
6. A machine learning heuristic to identify biologically relevant and minimal biomarker panels from omics data [O] . Anna L Swan, Dov J Stekel, Charlie Hodgman, 2015

机译：机器学习启发式算法可从组学数据中识别生物学上相关且最小的生物标志物
7. Learning page-independent heuristics for extracting data from web pages [O] . William Cohen, Wei Fan 1999

机译：学习与页面无关的启发式方法，用于从网页中提取数据
8. Learning to Extract Symbolic Knowledge from the World Wide Web [R] . Craven, M. , McCallum, A. , PiPasquo, D. , 1998

机译：学习从万维网中提取符号知识

Learning page-independent heuristics for extracting data from Web pages

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅