...
首页> 外文期刊>American journal of applied sciences >A Layout Based Detachment Approach for Extracting Content from Webpages
【24h】

A Layout Based Detachment Approach for Extracting Content from Webpages

机译:基于布局的分离方法从网页中提取内容

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Enormous amount of useful information presented in Internet is usually formatted for the web users. But it is a really complex task to extract the relevant data from various web sources. Recently, various approaches for the extraction of data from the webpages were proposed. This study provides a simple but effective approach, named Layout Based Detachment Approach (LBDA). The proposed approach extracts the main content from the webpage by removing the irrelevant information like header-footer contents, navigation bars, advertisements and other noisy images. The proposed methodology uses the following techniques: Tag tree parsing to get the analysis structure, block acquiring page segmentation method to remove unwanted tags and data extraction to retrieve the necessary contents. The proposed approach eliminates noise and perform effective extraction of the main content blocks from the webpage and display of the essential content to the users. The performance of the proposed approach is evaluated using the performance metrics such as accuracy, precision, recall, execution time and memory usage. The implementation results obviously show that our proposed LBDA approach exhibits better performance than the existing heuristic approach.
机译:Internet中提供的大量有用信息通常是为Web用户格式化的。但是,从各种Web来源提取相关数据是一项非常复杂的任务。最近,提出了从网页提取数据的各种方法。这项研究提供了一种简单但有效的方法,称为基于布局的分离方法(LBDA)。所提出的方法通过删除不相关的信息(例如页眉-页脚内容,导航栏,广告和其他嘈杂的图像)从网页中提取主要内容。所提出的方法使用以下技术:标记树解析以获取分析结构,块获取页面分割方法以去除不需要的标记,以及数据提取以检索必要的内容。所提出的方法消除了噪声,并从网页上有效地提取了主要内容块并向用户显示了基本内容。使用诸如准确性,精度,召回率,执行时间和内存使用率之类的性能指标来评估所提出方法的性能。实施结果显然表明,我们提出的LBDA方法具有比现有启发式方法更好的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号