首页> 外文期刊>Journal of digital information management >A New Recognition Approach for Logical Link Blocks in Webpages
【24h】

A New Recognition Approach for Logical Link Blocks in Webpages

机译:网页中逻辑链接块的一种新识别方法

获取原文
获取原文并翻译 | 示例
           

摘要

Link block is a block structure widely existing in webpages. Existing approaches to link blocks recognition generally suffer from two drawbacks: 1) they are designed only aiming at link blocks of physical structure, and even only aiming at specific link blocks of block-level elements; and 2) the discovery and recognition of link blocks are based on analyzing HTML tag trees, consequently, often leading to high computing cost and thus making them fail to deal with the diversified non-standard webpages on the Internet. To this end, in this paper we propose the concept of logical link blocks and then present an effective approach to discover and recognize logical link blocks from webpages. In the approach logical link blocks are recognized through scanning HTML codes and calculating the distance between adjacent links, and then two distance thresholds are used to determine the final logical link blocks. As a result, the approach not only can be free from the limits of specific block-level link blocks, but also can greatly improve the robustness as the analysis on HTML tag trees is no longer required. Finally, experimental results demonstrate the effectiveness of the proposed approach, which not only provide a new way for the recognition of logical link blocks and text extraction, but also can be applied in other web information processing and mining fields due to less demanding for particle size control of link blocks.
机译:链接块是广泛存在于网页中的块结构。现有的识别链接块的方法通常有两个缺点:1)它们仅针对物理结构的链接块而设计,甚至仅针对块级元素的特定链接块。 2)链接块的发现和识别是基于对HTML标签树的分析,因此,通常导致较高的计算成本,从而使它们无法处理Internet上多样化的非标准网页。为此,本文提出了逻辑链接块的概念,然后提出了一种从网页中发现和识别逻辑链接块的有效方法。在该方法中,逻辑链接块是通过扫描HTML代码并计算相邻链接之间的距离来识别的,然后使用两个距离阈值来确定最终的逻辑链接块。结果,该方法不仅可以摆脱特定块级链接块的限制,而且由于不再需要对HTML标记树的分析,因此可以大大提高鲁棒性。最后,实验结果证明了该方法的有效性,该方法不仅为识别逻辑链接块和文本提取提供了一种新方法,而且由于对粒度的要求较低,因此可以应用于其他网络信息处理和采矿领域控制链接块。

著录项

  • 来源
    《Journal of digital information management》 |2015年第2期|76-85|共10页
  • 作者单位

    Oujiang College, Wenzhou University, Wenzhou,Network Research Institute of Wenzhou, Wenzhou , Zhejiang, 325035, China;

    Oujiang College, Wenzhou University, Wenzhou;

    kloudSmart, Inc., 1175 Eagle Cliff Court, San Jose, CA 95120, U.S.A.;

    School of Mathematics and Computer Science, Hubei University of Arts and Science Xiangyang Hubei 441053, China,Institute of Logic and Intelligence, Southwest University, Chongqing 400715, China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Web; Link block; Logical Link Block; Recognition;

    机译:网络链接块;逻辑链接块;承认;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号