Crawling Result Pages for Data Extraction Based on URL Classification

机译：基于URL分类的数据提取爬网结果页面

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In Web database integration, crawling data pages is important for data extraction. The fact that data are contained by multiple result pages increases the difficulty of accessing data for integration. Thus, it is necessary to accurately and automatically crawl query result pages from Web database. To address this problem, we propose a novel approach based on URL classification to effectively identify result pages. In our approach, we compute the similarity between URLs of hyperlinks in result pages and classify them into four categories. Each category maps to a set of similar web pages, which separate result pages from others. Then, we use the page probing method to verify the correctness of classification and improve the accuracy of crawled result pages. The experimental result demonstrates that our approach is effective for identifying the collection of result pages in Web database, and can improve the quality and efficiency of data extraction.

机译：在Web数据库集成中，爬网数据页面对于数据提取很重要。数据包含在多个结果页面中的事实增加了访问数据进行集成的难度。因此，有必要从Web数据库中准确并自动地爬网查询结果页面。为了解决这个问题，我们提出了一种基于URL分类的新颖方法来有效地识别结果页面。在我们的方法中，我们计算结果页面中超链接的URL之间的相似度，并将它们分为四类。每个类别都映射到一组相似的网页，这些网页将结果页面与其他页面分开。然后，我们使用页面探测方法来验证分类的正确性，并提高抓取结果页面的准确性。实验结果表明，该方法可以有效地识别Web数据库中结果页面的集合，并可以提高数据提取的质量和效率。

著录项

来源
《7th Web Information Systems and Applications Conference》|2010年|P.79-84|共6页
会议地点
作者
Nie Tiezheng; Wang Zhenhua; Kou Yue; Zhang Rui;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算机网络;
关键词
URL; classification; component; data extraction; result pages;

机译：URL;分类;组件;数据提取;结果页;

相似文献

外文文献
中文文献
专利

1. Deep Web Data Extraction Based on URL and Domain Classification [J] . B. Aysha Banu, M. Chitra ISACA journal . 2015,第期

机译：基于URL和域分类的深度Web数据提取
2. Classification of Page to the aspect of Crawl Web Forum and URL Navigation [J] . Yerragunta Kartheek, T.Sunitha Rani International Journal of Computer Trends and Technology . 2015,第1期

机译：网页分类到抓取网络论坛和URL导航方面
3. CONTOURLET-BASED FEATURE EXTRACTION FOR COMPUTER-AIDED CLASSIFICATION OF ALZHEIMER’S DISEASE [J] . Debesh Jha, Goo-Rak Kwon Alzheimer’s & dementia: the journal of the Alzheimer’s Association . 2018,第7期

机译：基于Contourlet的计算机辅助疾病分类特征提取
4. Crawling Result Pages for Data Extraction Based on URL Classification [C] . Nie Tiezheng, Wang Zhenhua, Kou Yue, Web Information Systems and Applications Conference . 2010

机译：基于URL分类的数据提取爬网结果页面
5. New covariance-based feature extraction methods for classification and prediction of high-dimensional data. [D] . Sofolahan, Mopelola A. 2013

机译：基于协方差的新特征提取方法，用于高维数据的分类和预测。
6. An automated data extraction and classification pipeline to identify a novel type of neuron within the dorsal striatum based on single-cell patch clamp and confocal imaging data [O] . Miaomiao Mao, Aditya Nair, George J. Augustine 2020

机译：一种自动数据提取和分类管道以识别基于单细胞贴片钳和共聚焦成像数据的背体内的新型神经元的神经元
7. Unvisited URL Relevancy Calculation in Focused Crawling Based on Naïve Bayesian Classification [O] . Debashis Hati, Amritesh Kumar, Lizashree Mishra 2011

机译：基于朴素贝叶斯分类的集中抓取中未访问的URL相关性计算
8. Feature Extraction and Classification Results for the Batchelor-Hand-Brumfitt Artificial Data Base. [R] . Starks, S. A., Pau, K. C., de Figueiredo, R. J. P. 1976

机译：Batchelor-Hand-Brumfitt人工数据库的特征提取与分类结果。

Crawling Result Pages for Data Extraction Based on URL Classification

摘要

著录项

相似文献

相关主题

期刊订阅