首页> 外文会议>Annual ACM symposium on applied computing;ACM symposium on applied computing;SAC 2010 >WMS- Extracting Multiple Sections Data Records from Search Engine Results Pages
【24h】

WMS- Extracting Multiple Sections Data Records from Search Engine Results Pages

机译:WMS-从搜索引擎结果页提取多节数据记录

获取原文

摘要

In this paper, we develop an automatic wrapper for the extraction of multiple sections data records from search engine results pages. In the Information Extraction world, less attention has been focused on the development of wrappers for the extraction of multiple sections data records. This is evidenced by the fact that there is only one automatic wrapper, MSE developed for this purpose. Using the separation distance of data records and sections, MSE is able to distinguish sections and data records and extract them from search engine results pages. In this study, our approach is the use of DOM tree properties to develop an adaptive search method which is able to detect, differentiate, and partition sections and data records. The multiple sections data records labeled are used to pass through a few filtering stages, each filter is designed to filter out a particular group of irrelevant data until one data region containing the relevant records is found. Our filtering rules are designed based on visual cue such as text and image size obtained from the browser rendering engine. Experimental results show that our wrapper is able to obtain better results than the currently available MSE wrapper.
机译:在本文中,我们开发了一种自动包装程序,用于从搜索引擎结果页面中提取多个部分的数据记录。在信息提取世界中,较少的注意力集中在用于提取多节数据记录的包装器的开发上。事实证明,只有一个自动包装程序,MSE为此目的而开发。利用数据记录和部分的分隔距离,MSE能够区分部分和数据记录,并将其从搜索引擎结果页面中提取出来。在这项研究中,我们的方法是使用DOM树属性来开发一种自适应搜索方法,该方法能够检测,区分和分区节和数据记录。标记为多个部分的数据记录用于通过几个过滤阶段,每个过滤器设计为过滤掉一组特定的不相关数据,直到找到一个包含相关记录的数据区域。我们的过滤规则是根据视觉提示设计的,例如从浏览器渲染引擎获得的文本和图像大小。实验结果表明,与目前可用的MSE包装器相比,我们的包装器能够获得更好的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号