首页> 外文期刊>World Wide Web >Identifying semantic blocks in Web pages using Gestalt laws of grouping
【24h】

Identifying semantic blocks in Web pages using Gestalt laws of grouping

机译:使用格式塔分组定律识别网页中的语义块

获取原文
获取原文并翻译 | 示例
           

摘要

Semantic block identification is an approach to retrieve information from Web pages and applications. As Website design evolves, however, traditional methodologies cannot perform well any more. This paper proposes a new model to merge Web page content into semantic blocks by simulating human perception. A "layer tree" is constructed to remove hierarchical inconsistencies between the DOM tree representation and the visual layout of the Web page. Subsequently, the Gestalt laws of grouping are interpreted as the rules for semantic block detection. During interpretation, the normalized Hausdorff distance, the CIE-Lab color difference, the normalized compression distance, and the series of visual information are proposed to operationalize these Gestalt laws. Finally, a classifier is trained to combine each operationalized law into a unified rule for identifying semantic blocks from the Web page. Experiments are conducted to compare the efficiency of the model to a state-of-art algorithm, the VIPS. The comparison results of the first experiment show that the GLM model generates more "true positives" and less "false negatives" than VIPS. The next experiment upon a large-scale test set produces an average precision of 90.53 % and recall rate of 90.85 %, which is approximately 25 % better than that of VIPS.
机译:语义块识别是一种从网页和应用程序中检索信息的方法。但是,随着网站设计的发展,传统方法无法再发挥良好的作用。本文提出了一种通过模拟人类感知将网页内容合并为语义块的新模型。构造“层树”以消除DOM树表示和Web页面的可视布局之间的层次结构不一致。随后,格式塔定律的分组被解释为语义块检测的规则。在解释过程中,提出了规范化的Hausdorff距离,CIE-Lab色差,规范化的压缩距离和一系列视觉信息,以使这些格式塔定律得以实施。最后,对分类器进行训练,以将每个可操作的法律组合成统一的规则,以从Web网页中识别语义块。进行实验以将模型的效率与最新算法VIPS进行比较。第一个实验的比较结果表明,与VIPS相比,GLM模型产生的“真实正值”更多,而“错误负数”更少。在大型测试仪上进行的下一个实验将产生90.53%的平均精度和90.85%的召回率,这比VIPS大约高25%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号