首页> 外文会议>International Conference on Data Warehousing and Knowledge Discovery >OWDEAH: Online Web Data Extraction Based on Access History
【24h】

OWDEAH: Online Web Data Extraction Based on Access History

机译:OWDEAH:基于访问历史的在线网络数据提取

获取原文

摘要

Web data extraction systems are the kernel of information mediators between users and heterogeneous Web data resources. How to extract structured data from semi-structured documents has been a problem of active research. Supervised and unsupervised methods have been devised to learn extraction rules from training sets. However, trying to prepare training sets (especially to annotate them for supervised methods), is very time-consuming. We propose a framework for Web data extraction, which logged users' access history and exploit them to assist automatic training set generation. We cluster accessed Web documents according to their structural details; define criteria to measure the importance of sub-structures; and then generate extraction rules. We also propose a method to adjust the rules according to historical data. Our experiments confirm the viability of our proposal.
机译:Web数据提取系统是用户与异构Web数据资源之间的信息调解器内核。如何从半结构化文件中提取结构化数据一直是积极研究的问题。已经设计了监督和无监督的方法,以从培训集中学习提取规则。但是,试图准备培训集(特别是为监督方法注释它们),非常耗时。我们为Web数据提取提出了一个框架,它记录了用户访问历史记录并利用它们来帮助自动培训集生成。我们群集根据其结构细节访问Web文件;定义标准以测量子结构的重要性;然后生成提取规则。我们还提出了一种根据历史数据调整规则的方法。我们的实验证实了我们提案的可行性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号