首页> 外文期刊>International Journal of Electronic Business >Searching for web information more efficiently using presentational layout analysis
【24h】

Searching for web information more efficiently using presentational layout analysis

机译:使用演示布局分析更有效地搜索Web信息

获取原文
获取原文并翻译 | 示例
       

摘要

Extracting and processing information from web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the web. A common approach in the extraction process is to represent a page as a 'bag of words' and then to perform additional processing on such a flat representation. In this paper we propose a new, hierarchical representation that includes browser screen coordinates for every HTML object on a page. Using visual information one is able to define heuristics for recognition of common page areas such as a header, left and right menu, footer and the centre of a page. Initial experiments have shown that, using our heuristics, defined areas are recognised properly in 73% of cases. Finally, we introduce a classification system which, taking into account the proposed document layout analysis clearly outperforms standard systems by 10% or more.
机译:从网页提取和处理信息是许多领域的重要任务,例如构建搜索引擎,信息检索和从网络进行数据挖掘。提取过程中的一种常见方法是将页面表示为“单词袋”,然后对这种平面表示执行其他处理。在本文中,我们提出了一种新的分层表示形式,其中包括页面上每个HTML对象的浏览器屏幕坐标。使用视觉信息,可以定义用于识别常见页面区域(例如页眉,左右菜单,页脚和页面中心)的启发式方法。初步实验表明,使用我们的启发式方法,可以在73%的情况下正确识别定义的区域。最后,我们引入了一个分类系统,该系统考虑到建议的文档布局分析明显优于标准系统10%或更多。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号