首页> 外文会议>Fuzzy Systems and Knowledge Discovery (FSKD), 2012 9th International Conference on >Extraction of relevant components using shallow structure of HTML documents
【24h】

Extraction of relevant components using shallow structure of HTML documents

机译:使用HTML文档的浅层结构提取相关组件

获取原文

摘要

As the amount of web page increases, searching for semi-structured documents is gaining greater attention. The traditional approach for extracting data from web page documents is to write specialized programs, called wrappers that identify data of interest and map them to some suitable format. However, developing wrappers manually has many well known shortcomings, mainly due to the difficulty in writing and maintaining them for continually changing web data. Moreover, there is no one wrapper program that can treat all kinds of web pages. In this paper, we aim to extract relevant and meaningful snippets from as many web pages as possible, using the shallow feature of HTML documents to discover and analyze the relevant components. Also, we introduced a new feature called GAP and verified the effectiveness of GAP by conducting a SVM learning experiment.
机译:随着网页数量的增加,搜索半结构化文档越来越受到关注。从网页文档中提取数据的传统方法是编写称为包装程序的专用程序,该程序可以识别感兴趣的数据并将其映射为某种合适的格式。但是,手动开发包装器有许多众所周知的缺点,主要是由于难以编写和维护包装器以不断更改Web数据。而且,没有一种包装程序可以处理各种网页。在本文中,我们旨在使用HTML文档的浅层功能来发现和分析相关组件,从而从尽可能多的网页中提取相关且有意义的摘要。此外,我们引入了一个称为GAP的新功能,并通过进行SVM学习实验来验证GAP的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号