首页> 外文期刊>Journal of information and computational science >WPBL: A Webpage Block Labeling Based Approach for Web Information Extraction
【24h】

WPBL: A Webpage Block Labeling Based Approach for Web Information Extraction

机译:WPBL:一种基于网页块标记的Web信息提取方法

获取原文
获取原文并翻译 | 示例
           

摘要

Currently conventional web mining (e.g., information retrieval, information extraction) usually treat the whole webpage as a basic unit to process. This leads to some problems such as low extraction precision, the loss of semantic information, etc. In this paper, we present Webpage Block Labeling (WPBL) based algorithm. In our system, a webpage is firstly segmented into finer granularity-several structurally coherent blocks, and then they are automatically labeled. To identify the importance of different blocks, an effective ranking algorithm based on a block's location and its visual features on the webpage is proposed, which contributes to find important contents or links on a webpage. In addition, to demonstrate the prospects of WPBL, we apply WPBL to web information extraction. The experimental results show that the WPBL algorithm can significantly improve the performance of the information extraction.
机译:当前,常规的网络挖掘(例如,信息检索,信息提取)通常将整个网页作为处理的基本单元。这导致了诸如提取精度低,语义信息丢失等问题。在本文中,我们提出了基于网页块标记(WPBL)的算法。在我们的系统中,网页首先被分割成更细的粒度-几个结构上一致的块,然后将它们自动标记。为了识别不同块的重要性,提出了一种基于块的位置及其在网页上的视觉特征的有效排名算法,该算法有助于在网页上找到重要的内容或链接。另外,为了说明WPBL的前景,我们将WPBL应用于Web信息提取。实验结果表明,WPBL算法可以显着提高信息提取的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号