WPBL: A Webpage Block Labeling Based Approach for Web Information Extraction

Naizhou Zhang; Shijun Li; Zhuo Zhang; Wei Cao

首页> 外文期刊>Journal of information and computational science >WPBL: A Webpage Block Labeling Based Approach for Web Information Extraction

【24h】

WPBL: A Webpage Block Labeling Based Approach for Web Information Extraction

机译：WPBL：一种基于网页块标记的Web信息提取方法

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Currently conventional web mining (e.g., information retrieval, information extraction) usually treat the whole webpage as a basic unit to process. This leads to some problems such as low extraction precision, the loss of semantic information, etc. In this paper, we present Webpage Block Labeling (WPBL) based algorithm. In our system, a webpage is firstly segmented into finer granularity-several structurally coherent blocks, and then they are automatically labeled. To identify the importance of different blocks, an effective ranking algorithm based on a block's location and its visual features on the webpage is proposed, which contributes to find important contents or links on a webpage. In addition, to demonstrate the prospects of WPBL, we apply WPBL to web information extraction. The experimental results show that the WPBL algorithm can significantly improve the performance of the information extraction.

机译：当前，常规的网络挖掘（例如，信息检索，信息提取）通常将整个网页作为处理的基本单元。这导致了诸如提取精度低，语义信息丢失等问题。在本文中，我们提出了基于网页块标记（WPBL）的算法。在我们的系统中，网页首先被分割成更细的粒度-几个结构上一致的块，然后将它们自动标记。为了识别不同块的重要性，提出了一种基于块的位置及其在网页上的视觉特征的有效排名算法，该算法有助于在网页上找到重要的内容或链接。另外，为了说明WPBL的前景，我们将WPBL应用于Web信息提取。实验结果表明，WPBL算法可以显着提高信息提取的性能。

著录项

来源
《Journal of information and computational science》 |2010年第1期|P.49-55|共7页
作者
Naizhou Zhang; Shijun Li; Zhuo Zhang; Wei Cao;
展开▼
作者单位

School of Computer, Wuhan University, Wuhan 430072, China Zhixing College, Hubei University, Wuhan 430011, China;

School of Computer, Wuhan University, Wuhan 430072, China;

rnSchool of Computer, Wuhan University, Wuhan 430072, China;

rnComputer College, Wuhan Institute of Technology, Wuhan 430074, China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
webpage segmentation; webpage blocks labeling; Web Information Extraction;

机译：网页细分;网页块标签;Web信息提取;

相似文献

外文文献
中文文献
专利

1. An FAR-SW based approach for webpage information extraction [J] . Zhan Bu, Chengcui Zhang, Zhengyou Xia, Information systems frontiers . 2014,第5期

机译：基于FAR-SW的网页信息提取方法
2. A New Recognition Approach for Logical Link Blocks in Webpages [J] . X.M. WANG, Z.D. WU, Y.N. HUANG, Journal of digital information management . 2015,第2期

机译：网页中逻辑链接块的一种新识别方法
3. Rider-Rank Algorithm-Based Feature Extraction for Re-ranking the Webpages in the Search Engine [J] . Lata Jaywant Sankpal, Suhas H. Patil The Computer journal . 2020,第10期

机译：基于骑行秩算法的特征提取，用于在搜索引擎中重新排名网页
4. Automatic extraction of informative blocks from webpages [C] . Sandip Debnath, Prasenjit Mitra, C. Lee Giles ACM symposium on Applied computing . 2005

机译：自动从网页中提取信息块
5. Detecting malicious Webpages using content based classification . [D] . Bannur, Sushma Nagesh. 2011

机译：使用基于内容的分类检测恶意网页。
6. Quality of top webpages providing abortion pill information for Google searches in the USA: An evidence-based webpage quality assessment [O] . Elizabeth Pleasants, Sylvia Guendelman, Karen Weidert, 2021

机译：顶部网页的质量提供美国谷歌搜索的堕胎药物信息：基于证据的网页质量评估
7. Improving Webpage Content Extraction by extending a novel single page extraction approach: A case study with Thai websites [O] . Thanadechteemapat W., Fung C.C. 2012

机译：通过扩展新颖的单页提取方法来改善网页内容提取：以泰国网站为例

WPBL: A Webpage Block Labeling Based Approach for Web Information Extraction

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅