Extraction of relevant components using shallow structure of HTML documents

机译：使用HTML文档的浅层结构提取相关组件

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

As the amount of web page increases, searching for semi-structured documents is gaining greater attention. The traditional approach for extracting data from web page documents is to write specialized programs, called wrappers that identify data of interest and map them to some suitable format. However, developing wrappers manually has many well known shortcomings, mainly due to the difficulty in writing and maintaining them for continually changing web data. Moreover, there is no one wrapper program that can treat all kinds of web pages. In this paper, we aim to extract relevant and meaningful snippets from as many web pages as possible, using the shallow feature of HTML documents to discover and analyze the relevant components. Also, we introduced a new feature called GAP and verified the effectiveness of GAP by conducting a SVM learning experiment.

机译：随着网页数量的增加，搜索半结构化文档越来越受到关注。从网页文档中提取数据的传统方法是编写称为包装程序的专用程序，该程序可以识别感兴趣的数据并将其映射为某种合适的格式。但是，手动开发包装器有许多众所周知的缺点，主要是由于难以编写和维护包装器以不断更改Web数据。而且，没有一种包装程序可以处理各种网页。在本文中，我们旨在使用HTML文档的浅层功能来发现和分析相关组件，从而从尽可能多的网页中提取相关且有意义的摘要。此外，我们引入了一个称为GAP的新功能，并通过进行SVM学习实验来验证GAP的有效性。

著录项

来源
《Fuzzy Systems and Knowledge Discovery (FSKD), 2012 9th International Conference on》|2012年|p.1186- 1190|共5页
会议地点
作者
Zeng Jun;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类人工智能理论;
关键词

相似文献

外文文献
中文文献
专利

1. Relevance-based content extraction of HTML documents [J] . WU Qi, CHEN Xing-shu, ZHU Kai, 中南大学学报（英文版） . 2012,第007期

机译：HTML文档基于相关性的内容提取
2. Employing Clustering Techniques for Automatic Information Extraction From HTML Documents [J] . Ashraf F., 脰zyer T., Alhajj R. IEEE transactions on systems, man and cybernetics. Part C, Applications and reviews . 2008,第5期

机译：使用聚类技术从HTML文档中自动提取信息
3. AUTOMATIC MACHINE LEARNING OF KEYPHRASE EXTRACTION FROM SHORT HTML DOCUMENTS WRITTEN IN HEBREW [J] . YAAKOV HACOHEN-KERNER, ITTAY STERN, DAVID KORKUS, Cybernetics and Systems . 2007,第1期

机译：从希伯来语简短HTML文档中提取关键词的自动机器学习
4. Extraction of Relevant Components Using Shallow Structure of HTML Documents [C] . Jun Zeng, Brendan Flanagan, Toshihiko Sakai, International Conference on Fuzzy Systems and Knowledge Discovery . 2012

机译：HTML文档的浅层结构提取相关组件
5. Context-based content extraction of HTML documents. [D] . Gupta, Suhit. 2006

机译：HTML文档的基于上下文的内容提取。
6. Incorporating deep and shallow components of genetic structure into the management of Alaskan red king crab [O] . William Stewart Grant, Wei Cheng 2012

机译：将遗传结构的深浅组成部分纳入阿拉斯加红帝王蟹的管理
7. Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval [O] . Yunhua Hu, Guomao Xin, Ruihua Song, 2005

机译：HTML文档主体的标题提取及其在网页检索中的应用

Extraction of relevant components using shallow structure of HTML documents

摘要

著录项

相似文献

相关主题

期刊订阅