首页> 外文期刊>Information systems frontiers >An FAR-SW based approach for webpage information extraction
【24h】

An FAR-SW based approach for webpage information extraction

机译:基于FAR-SW的网页信息提取方法

获取原文
获取原文并翻译 | 示例
           

摘要

Automatically identifying and extracting the target information of a webpage, especially main text, is a critical task in many web content analysis applications, such as information retrieval and automated screen reading. However, compared with typical plain texts, the structures of information on the web are extremely complex and have no single fixed template or layout. On the other hand, the amount of presentation elements on web pages, such as dynamic navigational menus, flashing logos, and a multitude of ad blocks, has increased rapidly in the past decade. In this paper, we have proposed a statistics-based approach that integrates the concept of fuzzy association rules (FAR) with that of sliding window (SW) to efficiently extract the main text content from web pages. Our approach involves two separate stages. In Stage 1, the original HTML source is pre-processed and features are extracted for every line of text; then, a supervised learning is performed to detect fuzzy association rules in training web pages. In Stage 2, necessary HTML source preprocessing and text line feature extraction are conducted the same way as that of Stage 1, after which each text line is tested whether it belongs to the main text by extracted fuzzy association rules. Next, a sliding window is applied to segment the web page into several potential topical blocks. Finally, a simple selection algorithm is utilized to select those important blocks that are then united as the detected topical region (main texts). Experimental results on real world data show that the efficiency and accuracy of our approach are better than existing Document Object Model (DOM)-based and Vision-based approaches.
机译:在许多Web内容分析应用程序中,例如信息检索和自动屏幕阅读,自动识别和提取网页的目标信息(尤其是正文)是一项关键任务。但是,与典型的纯文本相比,网络上的信息结构极为复杂,并且没有单一的固定模板或布局。另一方面,在过去的十年中,网页上的表示元素(例如动态导航菜单,闪烁的徽标和大量广告块)的数量迅速增加。在本文中,我们提出了一种基于统计的方法,该方法将模糊关联规则(FAR)与滑动窗口(SW)的概念相结合,以有效地从网页中提取主要文本内容。我们的方法涉及两个单独的阶段。在第1阶段中,原始HTML源得到了预处理,并为每行文本提取了特征。然后,进行监督学习以检测训练网页中的模糊关联规则。在阶段2中,以与阶段1相同的方式进行必要的HTML源预处理和文本行特征提取,然后通过提取的模糊关联规则测试每个文本行是否属于主文本。接下来,应用滑动窗口将网页分成几个潜在的主题块。最后,利用一种简单的选择算法来选择那些重要的块,然后将其组合为检测到的主题区域(正文)。对现实世界数据的实验结果表明,我们的方法的效率和准确性要优于现有的基于文档对象模型(DOM)和基于视觉的方法。

著录项

  • 来源
    《Information systems frontiers》 |2014年第5期|771-785|共15页
  • 作者单位

    College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China,Computer and Information Sciences, The University of Alabama at Birmingham, Birmingham, AL, USA;

    Computer and Information Sciences, The University of Alabama at Birmingham, Birmingham, AL, USA;

    College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China;

    College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Information extraction; Statistics-based; Fuzzy association rule; Sliding window; Topical region;

    机译:信息提取;基于统计;模糊关联规则;滑动窗口;局部区域;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号