首页> 外文会议>Internet Multimedia Services Architecture and Applications (IMSAA), 2009 >Web page DOM node characterization and its application to page segmentation
【24h】

Web page DOM node characterization and its application to page segmentation

机译:Web DOM节点特征及其在页面分割中的应用

获取原文

摘要

Web pages are generally organized in terms of visually distinct segments, such as Navigation bars, Advertisement banners, Headers, Portlets and Widgets. Despite the apparent structured layout, web pages are considered a source of unstructured data, from information extraction point of view. Hence, as a step towards interpreting the organization of web data, web page segmentation attempts to identify cohesive regions of a page. In this paper, we present a novel DOM tree mining approach for page segmentation. We first characterize the nodes of DOM tree structure, based on their Content Size and Entropy. While Content Size of a node indicates the amount of textual content contributed by its subtree, Entropy measures the strength of local ¿patterns¿ exhibited therein. In other words, a node manifesting highly repetitive patterns begets a high Entropy as per our formulation. Based on the characterization of DOM nodes, we then develop an unsupervised algorithm to automatically identify segments of a given web page.
机译:网页通常在视觉栏,例如导航栏,广告横幅,标题,portlet和小部件方面组织。尽管表观结构化布局,Web页面被视为非结构化数据的源,从信息提取的角度来看。因此,作为解释网络数据组织的步骤,网页分段试图识别页面的凝聚区域。在本文中,我们提出了一种用于页面分割的新型DOM树挖掘方法。我们首先基于其内容大小和熵来表征DOM树结构的节点。虽然节点的内容大小指示由其子树贡献的文本内容的量,但熵测量其中展示的本地ÃÂ,PatternsÃÂ,的强度。换句话说,表现出高度重复模式的节点根据我们的配方而产生高熵。基于DOM节点的表征,我们开发了一个无监督的算法,以自动识别给定网页的段。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号