首页> 外文期刊>Journal of Advanced Computatioanl Intelligence and Intelligent Informatics >A Signal-Representation-Based Parser to Extract Text-Based Information from the Web
【24h】

A Signal-Representation-Based Parser to Extract Text-Based Information from the Web

机译:基于信号表示的解析器,用于从Web提取基于文本的信息

获取原文
获取原文并翻译 | 示例
       

摘要

Most of the dramatically increased amount of information available on the World Wide Web is provided via HTML and formatted for human browsing rather than for software programs. This situation calls for a tool that automatically extracts information from semistructured Web information sources, increasing the usefulness of value-added Web services. We present a signal-representation-based parser (SIRAP) that breaks Web pages up into logically coherent groups - groups of information related to an entity, for example. Templates for records with different tag structures are generated incrementally by a Histogram-Based Correlation Coefficient (HBCC) algorithm, then records on a Web page are detected efficiently using templates generated by matching. Hundreds of Web pages from 17 state-of-the-art search engines were used to demonstrate the feasibility of our approach.
机译:万维网上大量增加的信息大部分是通过HTML提供的,其格式供人类浏览而不是软件程序使用。这种情况要求使用一种从半结构化Web信息源自动提取信息的工具,从而增加了增值Web服务的实用性。我们提出了一种基于信号表示的解析器(SIRAP),它将Web页面分解为逻辑上相关的组-例如,与实体相关的信息组。通过基于直方图的相关系数(HBCC)算法递增地生成具有不同标签结构的记录的模板,然后使用通过匹配生成的模板来有效地检测Web页面上的记录。来自17个最先进的搜索引擎的数百个网页被用来证明我们方法的可行性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号