EGA: An Algorithm for Automatic Semi-structured Web Documents Extraction

机译：EGA：一种自动半结构化Web文档提取算法

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

With the fast expansion of World Wide Web, more and more semi-structured web documents appear on the web. In this paper, we study how to extract information from the semi-structured web documents by automatically generated wrappers. To automate the wrapper generation and the data extraction process, we develop a novel algorithm EGA (EPattern Generation Algorithm) to conduct the extraction pattern based on the local structural context features of the web documents. These optimal or near optimal extraction patterns are described in XPath language. Experimental results on RISE and our own data sets confirm the feasibility of our approach.

机译：随着万维网的快速扩展，越来越多的半结构化Web文档出现在Web上。在本文中，我们研究了如何通过自动生成的包装程序从半结构化Web文档中提取信息。为了使包装器的生成和数据提取过程自动化，我们开发了一种新颖的算法EGA（EPattern生成算法）来基于Web文档的局部结构上下文特征进行提取模式。这些最佳或接近最佳的提取模式以XPath语言描述。 RISE和我们自己的数据集的实验结果证实了我们方法的可行性。

著录项

来源
《International Conference on Database Systems for Advanced Applications(DASFAA 2004); 20040317-20040319; Jeju Island; KR》|2004年|P.787-798|共12页
会议地点 Jeju Island(KR);Jeju Island(KR)
作者
Liyu Li; Shiwei Tang; Dongqing Yang; Tengjiao Wang; Zhihua Su;
展开▼
作者单位

National Laboratory On Machine Perception, Peking University, Beijing, China;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类各种专用数据库;
关键词
information extraction; genetic algorithm; machine learning; semi-structured document; XPath;

机译：信息提取遗传算法机器学习半结构化文档XPath;

相似文献

外文文献
中文文献
专利

1. Automatic Extraction of Objects and their Attributes from Semi-Structured Web Tables for E-commerce Tasks [J] . Yerzhan Baiburin, Aliya Nugumanova Indian Journal of Science and Technology . 2015,第30期

机译：从半结构化Web表中自动提取对象及其属性以完成电子商务任务
2. Automatic information extraction from semi-structured Web pages by pattern discovery [J] . Chia-Hui Chang, Chun-Nan Hsu, Shao-Cheng Lui Decision support systems . 2003,第1期

机译：通过模式发现从半结构化网页中自动提取信息
3. Automatic ontology-based knowledge extraction from Web documents [J] . Alani H., Sanghee Kim, Millard D.E., IEEE intelligent systems & their applications . 2003,第1期

机译：从Web文档自动提取基于本体的知识
4. EGA: An Algorithm for Automatic Semi-structured Web Documents Extraction [C] . Liyu Li, Shiwei Tang, Dongqing Yang, International Conference on Database Systems for Advanced Applications . 2004

机译：EGA：用于自动半结构的Web文档提取算法
5. Automatic term extraction and document similarity in special text corpora. [D] . Dong, Li. 2002

机译：特殊文本语料库中的自动术语提取和文档相似性。
6. Retracted: An Automatic Web Service Composition Framework Using QoS-Based Web Service Ranking Algorithm [O] . The Scientific World Journal 2016

机译：缩回：使用基于QoS的Web服务排名算法的自动Web服务组合框架
7. Automatic construction and adaptation of wrappers for semi-structured web documents. [O] . 2003

机译：自动构建和适应半结构化Web文档的包装。

EGA: An Algorithm for Automatic Semi-structured Web Documents Extraction

摘要

著录项

相似文献

相关主题

期刊订阅