Exploiting Content Redundancy for Web Information Extraction

机译：利用内容冗余进行Web信息提取

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We propose a novel extraction approach that exploits content redundancy on the web to extract structured data from template-based web sites. We start by populating a seed database with records extracted from a few initial sites. We then identify values within the pages of each new site that match attribute values contained in the seed set of records. To filter out noisy attribute value matches, we exploit the fact that attribute values occur at fixed positions within template-based sites. We develop an efficient Apriori-style algorithm to systematically enumerate attribute position configurations with sufficient matching values across pages. Finally, we conduct an extensive experimental study with real-life web data to demonstrate the effectiveness of our extraction approach.

机译：我们提出了一种新颖的提取方法，该方法利用Web上的内容冗余从基于模板的网站中提取结构化数据。我们首先用从几个初始站点提取的记录填充种子数据库。然后，我们在每个新站点的页面中标识与记录的种子集中包含的属性值匹配的值。为了滤除嘈杂的属性值匹配，我们利用了以下事实：属性值出现在基于模板的网站内的固定位置。我们开发了一种高效的Apriori风格算法，可以系统地枚举页面中具有足够匹配值的属性位置配置。最后，我们使用现实生活中的网络数据进行了广泛的实验研究，以证明我们提取方法的有效性。

著录项

来源
《19th international world wide web conference 2010》|2010年|P.1105-1106|共2页
会议地点
作者
Pankaj Gulhane; Rajeev Rastogi; Srinivasan H Sengamedu; Ashwin Tengli;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算机网络;
关键词
content redundancy; information extraction;

机译：内容冗余;信息提取;

相似文献

外文文献
中文文献
专利

1. Automatic Workarounds: Exploiting the Intrinsic Redundancy of Web Applications [J] . ANTONIO CARZANIGA, ALESSANDRA GORLA, NICOLO PERINO, ACM transactions on software engineering and methodology . 2015,第3期

机译：自动变通办法：利用Web应用程序的固有冗余
2. Convolutional neural networks for relevance feedback in content based image retrieval A Content based image retrieval system that exploits convolutional neural networks both for feature extraction and for relevance feedback [J] . Lorenzo Putzu, Luca Piras, Giorgio Giacinto Multimedia Tools and Applications . 2020,第37a38期

机译：基于内容的图像检索的相关反馈的卷积神经网络基于内容的图像检索系统，用于利用特征提取和相关性反馈的卷积神经网络
3. Relation Extraction from Web Contents with Linguistic and Web Features（言語分析およびWeb上の情報を用いたコンテンツからの関係の抽出） [J] . 顔玉蘭人工知能学会志 . 2011,第1期

机译：使用语言和Web功能从Web内容中提取关系（使用Web上的信息进行语言分析和从内容中提取关系）
4. Exploiting Content Redundancy for Web Information Extraction [C] . Pankaj Gulhane, Rajeev Rastogi, Srinivasan H Sengamedu, International conference on very large data bases;VLDB 2010 . 2011

机译：利用内容冗余进行Web信息提取
5. ConCORD: Tracking and Exploiting Cross-Node Memory Content Redundancy in Large-Scale Parallel Systems [D] . Xia, Lei 2013

机译：ConCORD：跟踪和利用大规模并行系统中的跨节点内存内容冗余
6. High Redundancy as well as Complementary Prey Choice Characterize Generalist Predator Food Webs in Agroecosystems [O] . Eve Roubinet, Tomas Jonsson, Gerard Malsher, -1

机译：高冗余度和互补的猎物选择是农业生态系统中通体捕食者食物网的特征
7. Exploiting Content Redundancy for Web Information Extraction [O] . Pankaj Gulhane, Rajeev Rastogi, Srinivasan H Sengamedu, 2010

机译：利用内容冗余进行Web信息提取

Exploiting Content Redundancy for Web Information Extraction

摘要

著录项

相似文献

相关主题

期刊订阅