首页> 外文会议>Latin American Web Conference >A Rendering-Based Method for Selecting the Main Data Region in Web Pages
【24h】

A Rendering-Based Method for Selecting the Main Data Region in Web Pages

机译:基于渲染的Web页面中的主数据区域的方法

获取原文

摘要

Extracting data from web pages is an important task for several applications, such as comparison shopping and data mining. Much of that data is provided by search result pages, in which each result, called search result record, represents a record from a database. One of the most important steps for extracting such records is identifying, among different data regions from a page, one that contains the records to be extracted. An incorrect identification of this region may lead to an incorrect extraction of the search result records. In this paper, we propose a simple but efficient method that generates path expression to select the main data region from a given page, based on the rendering area information of its elements. The generated path expression may be used by wrappers for extracting the search result records and its data units, reducing its complexity and increasing its accuracy. Experimental results using web pages from several domains show that the method is highly effective.
机译:从Web页面中提取数据是多个应用程序的重要任务,例如比较购物和数据挖掘。这些数据由搜索结果页面提供,其中每个结果,称为搜索结果记录,表示来自数据库的记录。提取此类记录的最重要步骤之一是在从页面的不同数据区域中识别,其中包含要提取的记录。该区域的错误识别可能导致搜索结果记录的提取不正确。在本文中,我们提出了一种简单但有效的方法,其基于其元素的渲染区域信息,从给定页面中选择来自给定页面的主数据区域。所生成的路径表达式可以由包装器使用,用于提取搜索结果记录及其数据单元,降低其复杂性并提高其准确性。使用来自几个域的网页的实验结果表明该方法非常有效。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号