A Layout Based Detachment Approach for Extracting Content from Webpages

Deepa Chandran; Anna Saro Vijendran

首页> 外文期刊>American journal of applied sciences >A Layout Based Detachment Approach for Extracting Content from Webpages

【24h】

A Layout Based Detachment Approach for Extracting Content from Webpages

机译：基于布局的分离方法从网页中提取内容

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Enormous amount of useful information presented in Internet is usually formatted for the web users. But it is a really complex task to extract the relevant data from various web sources. Recently, various approaches for the extraction of data from the webpages were proposed. This study provides a simple but effective approach, named Layout Based Detachment Approach (LBDA). The proposed approach extracts the main content from the webpage by removing the irrelevant information like header-footer contents, navigation bars, advertisements and other noisy images. The proposed methodology uses the following techniques: Tag tree parsing to get the analysis structure, block acquiring page segmentation method to remove unwanted tags and data extraction to retrieve the necessary contents. The proposed approach eliminates noise and perform effective extraction of the main content blocks from the webpage and display of the essential content to the users. The performance of the proposed approach is evaluated using the performance metrics such as accuracy, precision, recall, execution time and memory usage. The implementation results obviously show that our proposed LBDA approach exhibits better performance than the existing heuristic approach.

机译：Internet中提供的大量有用信息通常是为Web用户格式化的。但是，从各种Web来源提取相关数据是一项非常复杂的任务。最近，提出了从网页提取数据的各种方法。这项研究提供了一种简单但有效的方法，称为基于布局的分离方法（LBDA）。所提出的方法通过删除不相关的信息（例如页眉-页脚内容，导航栏，广告和其他嘈杂的图像）从网页中提取主要内容。所提出的方法使用以下技术：标记树解析以获取分析结构，块获取页面分割方法以去除不需要的标记，以及数据提取以检索必要的内容。所提出的方法消除了噪声，并从网页上有效地提取了主要内容块并向用户显示了基本内容。使用诸如准确性，精度，召回率，执行时间和内存使用率之类的性能指标来评估所提出方法的性能。实施结果显然表明，我们提出的LBDA方法具有比现有启发式方法更好的性能。

著录项

来源
《American journal of applied sciences》 |2015年第6期|411-420|共10页
作者
Deepa Chandran; Anna Saro Vijendran;
展开▼
作者单位

Department of Information Technology, SNR Sons College, Coimbatore, India;

MCA, SNR Sons College, Coimbatore, India;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Webpage Content Extraction; Web Mining; DOM Tree Analysis; Web Structure Mining;

机译：网页内容提取;网络挖掘;DOM树分析;Web结构挖掘;

相似文献

外文文献
中文文献
专利

1. A Layout Based Detachment Approach for Extracting Content from Webpages | Science Publications [J] . Anna Saro Vijendran, Deepa Chandran American journal of applied sciences . 2015,第6期

机译：基于布局的分离方法从网页中提取内容科学出版物
2. A keyword-based combination approach for detecting phishing webpages [J] . Ding Yan, Luktarhan Nurbol, Li Keqin, Computers & Security . 2019,第JULa期

机译：基于关键字的组合方法检测网络钓鱼网页
3. A keyword-based combination approach for detecting phishing webpages [J] . Ding Yan, Luktarhan Nurbol, Li Keqin, Computers & Security . 2019,第Jula期

机译：一种基于关键字的检测网络钓鱼网页的组合方法
4. Web Content Extraction based on Webpage Layout Analysis [C] . Lei FU, Yao MENG, Yingju XIA, International Conference on Information Technology and Computer Science . 2010

机译：基于网页布局分析的Web内容提取
5. Detecting malicious Webpages using content based classification . [D] . Bannur, Sushma Nagesh. 2011

机译：使用基于内容的分类检测恶意网页。
6. Content-Based Image Retrieval Using Spatial Layout Information in Brain Tumor T1-Weighted Contrast-Enhanced MR Images [O] . Meiyan Huang, Wei Yang, Yao Wu, -1

机译：使用脑肿瘤T1加权对比增强MR图像中空间布局信息的基于内容的图像检索
7. Learning to Extract Content from News Webpages [O] . Alex Spengler, Patrick Gallinari 2015

机译：学习从新闻网页中提取内容

A Layout Based Detachment Approach for Extracting Content from Webpages

摘要

著录项

相似文献

相关主题

期刊订阅