首页> 外文会议>International Conference on Signal-Image Technology and Internet- Based Systems >Extracting the Latent Hierarchical Structure of Web Documents
【24h】

Extracting the Latent Hierarchical Structure of Web Documents

机译:提取Web文档的潜在层次结构

获取原文

摘要

The hierarchical structure of a document plays an important role in understanding the relationships between its contents. However, such a structure is not always explicitly represented in web documents through available html hierarchical tags. Headings however, are usually differentiated from 'normal' text in a document in terms of presentation thus providing an implicit structure discernable by a human reader. As such, an important pre-processing step for applications that need to operate on the hierarchical level is to extract the implicitly represented hierarchal structure. In this paper, an algorithm for heading detection and heading level detection which makes use of various visual presentations is presented. Results of evaluating this algorithm are also reported.
机译:文档的层次结构在理解其内容之间的关系方面发挥着重要作用。但是,通过可用的HTML分层标记,在Web文档中并不总是在Web文档中明确地表示这样的结构。然而,标题通常在呈现方面的文档中的“正常”文本不同,从而提供人类读者可辨别的隐式结构。因此,需要在分层级别操作的应用程序的重要预处理步骤是提取隐式表示的层次结构。在本文中,呈现了一种用于使用各种视觉演示的前进检测和前置电平检测算法。还报道了评估该算法的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号