A Framework for Automated Scraping of Structured Data Records From the Deep Web Using Semantic Labeling: Semantic Scraper

Umamageswari Kumaresan; Kalpana Ramanujam

首页> 外文期刊>International journal of information retrieval research >A Framework for Automated Scraping of Structured Data Records From the Deep Web Using Semantic Labeling: Semantic Scraper

【24h】

A Framework for Automated Scraping of Structured Data Records From the Deep Web Using Semantic Labeling: Semantic Scraper

机译：A Framework for Automated Scraping of Structured Data Records From the Deep Web Using Semantic Labeling: Semantic Scraper

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相关主题

摘要

The intent of this research is to come up with an automated web scraping system which is capable of extracting structured data records embedded in semi-structured web pages. Most of the automated extraction techniques in the literature capture repeated patterns among a set of similarly structured web pages, thereby deducing the template used for the generation of those web pages, and then data records extraction is done. All of these techniques exploit computationally intensive operations such as string pattern matching or DOM tree matching and then perform manual labeling of extracted data records. The technique discussed in this paper departs from the state-of-the-art approaches by determining informative sections in the web page through repetition of informative content rather than syntactic structure. From the experiments, it is clear that the system has identified data rich region with 100% precision for websites belonging to different domains. The experiments conducted on the real-world websites prove the effectiveness and versatility of the proposed approach.

著录项

来源
《International journal of information retrieval research》 |2022年第1期|266-283|共18页
作者
Umamageswari Kumaresan; Kalpana Ramanujam;
展开▼
作者单位

Pondicherry Engineering College, India;

展开▼
收录信息
原文格式 PDF
正文语种英语
中图分类
关键词
Deep Web; DOM Tree; HTML; Server-Side Templates; Structured Data; Supervised Extraction; Surface Web; Unsupervised Extraction; Web Scraping; XPATH;
入库时间 2024-01-25 00:46:55

A Framework for Automated Scraping of Structured Data Records From the Deep Web Using Semantic Labeling: Semantic Scraper

摘要

著录项

相关主题

期刊订阅