A Signal-Representation-Based Parser to Extract Text-Based Information from the Web

Mu-Chun Su; Shao-Jui Wang; Chen-Ko Huang; Pa-Chun Wang; Fu-Hau Hsu; Shih-Chieh Lin; Yi-Zeng Hsieh

首页> 外文期刊>Journal of Advanced Computatioanl Intelligence and Intelligent Informatics >A Signal-Representation-Based Parser to Extract Text-Based Information from the Web

【24h】

A Signal-Representation-Based Parser to Extract Text-Based Information from the Web

机译：基于信号表示的解析器，用于从Web提取基于文本的信息

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Most of the dramatically increased amount of information available on the World Wide Web is provided via HTML and formatted for human browsing rather than for software programs. This situation calls for a tool that automatically extracts information from semistructured Web information sources, increasing the usefulness of value-added Web services. We present a signal-representation-based parser (SIRAP) that breaks Web pages up into logically coherent groups - groups of information related to an entity, for example. Templates for records with different tag structures are generated incrementally by a Histogram-Based Correlation Coefficient (HBCC) algorithm, then records on a Web page are detected efficiently using templates generated by matching. Hundreds of Web pages from 17 state-of-the-art search engines were used to demonstrate the feasibility of our approach.

机译：万维网上大量增加的信息大部分是通过HTML提供的，其格式供人类浏览而不是软件程序使用。这种情况要求使用一种从半结构化Web信息源自动提取信息的工具，从而增加了增值Web服务的实用性。我们提出了一种基于信号表示的解析器（SIRAP），它将Web页面分解为逻辑上相关的组-例如，与实体相关的信息组。通过基于直方图的相关系数（HBCC）算法递增地生成具有不同标签结构的记录的模板，然后使用通过匹配生成的模板来有效地检测Web页面上的记录。来自17个最先进的搜索引擎的数百个网页被用来证明我们方法的可行性。

著录项

来源
《Journal of Advanced Computatioanl Intelligence and Intelligent Informatics》 |2010年第77期|共9页
作者
Mu-Chun Su; Shao-Jui Wang; Chen-Ko Huang; Pa-Chun Wang; Fu-Hau Hsu; Shih-Chieh Lin; Yi-Zeng Hsieh;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类其他计算机;
关键词
Information extraction; Wrapper; Parser; Web; Template matching;

机译：信息提取包装器解析器Web Web模板匹配;
入库时间 2022-08-18 23:09:57

相似文献

外文文献
中文文献
专利

1. A Signal-Representation-Based Parser to Extract Text-Based Information from the Web [J] . Mu-Chun Su, Shao-Jui Wang, Chen-Ko Huang, Journal of Advanced Computatioanl Intelligence and Intelligent Informatics . 2010,第5a77期

机译：基于信号表示的解析器，用于从Web提取基于文本的信息
2. OMSSA Parser: An open-source library to parse and extract data from OMSSA MS/MS search results [J] . Proteomics . 2009,第14期

机译：OMSSA Parser：一个开放源代码库，用于从OMSSA MS / MS搜索结果中解析和提取数据
3. Extracting Partial Parsing Rules from Tree-Annotated Corpus: Toward Deterministic Global Parsing [J] . Myung-Seok CHOI, Kong-Joo LEE, Key-Sun CHOI, IEICE Transactions on Information and Systems . 2005,第6期

机译：从带树注释的语料库中提取部分解析规则：走向确定性全局解析
4. Learning to Extract Text-based Information from the World Wide Web [C] . Stephen Soderland National Conferences on Aritificial Intelligence . 1999

机译：学习从万维网中提取基于文本的信息
5. The effects of screen size, information organization, and time on user comprehension of text-based information presented in a Web browser. [D] . Burgee, Lawrence E. 2005

机译：屏幕大小，信息组织和时间对用户理解Web浏览器中基于文本的信息的影响。
6. Study protocol for iQuit in Practice: a randomised controlled trial to assess the feasibility acceptability and effectiveness of tailored web- and text-based facilitation of smoking cessation in primary care [O] . Stephen Sutton, Susan Smith, James Jamison, 2013

机译：iQuit在实践中的研究方案：一项随机对照试验旨在评估基于网络和文本的量身定制的初级保健戒烟的可行性可接受性和有效性
7. Applying automatic text-based detection of deceptive language to police reports: Extracting behavioral patterns from a multi-step classification model to understand how we lie to the police [O] . Lara Quijano-Sánchez, Federico Liberatore, José Camacho-Collados, 2018

机译：将基于文本的欺骗性语言应用于警方报告：从多步分类模型中提取行为模式以了解我们如何欺骗警方

A Signal-Representation-Based Parser to Extract Text-Based Information from the Web

摘要

著录项

相似文献

相关主题

期刊订阅