DOM-based Content Extraction of HTML Documents

机译：基于DOm的HTmL文档内容提取

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Web pages often contain clutter (such as pop-up ads, unnecessary images and extraneous links) around the body of an article that distract a user from actual content. Extraction of 'useful and relevant' content from web pages has many applications, including cell phone and PDA browsing, speech rendering for the visually impaired, and text summarization. Most approaches to removing clutter or making content more readable involve changing font size or removing HTML and data components such as images, which takes away from a webpage's inherent look and feel. Unlike 'Content Reformatting', which aims to reproduce the entire webpage in a more convenient form, our solution directly addresses 'Content Extraction'. We have developed a framework that employs an easily extensible set of techniques that incorporate advantages of previous work on content extraction. Our key insight is to work with the Document Object Model tree, rather than with raw HTML markup. We have implemented our approach in a publicly available Web proxy to extract content from HTML web pages.

著录项

作者
Gupta, S. ; Kaiser, G. ; Neistadt, D. ; Grimm, P.;
展开▼
作者单位

展开▼
年度 2005
页码 1-11
总页数 11
原文格式 PDF
正文语种 eng
中图分类工业技术;
关键词
Clutter; Extraction; Internet; Algorithms; Hypertext; Natural language; Identification; Information retrieval;

机译：杂波;提取; Internet;算法;超文本;自然语言;识别;信息检索;

相似文献

外文文献
中文文献
专利

1. Relevance-based content extraction of HTML documents [J] . WU Qi, CHEN Xing-shu, ZHU Kai, 中南大学学报（英文版） . 2012,第007期

机译：HTML文档基于相关性的内容提取
2. Automating Content Extraction of HTML Documents [J] . SUHIT GUPTA, GAIL E. KAISER, PETER GRIMM, World Wide Web . 2005,第2期

机译：自动提取HTML文档的内容
3. Enhancing the Browser-Side Context-Aware Sanitization of Suspicious HTML5 Code for Halting the DOM-Based XSS Vulnerabilities in Cloud [J] . B.B. Gupta, Shashank Gupta, Pooja Chaudhary International journal of cloud applications and computing . 2017,第1期

机译：增强可疑HTML5代码的浏览器端上下文感知消毒，以中止云中基于DOM的XSS漏洞
4. DOM-Based XHTML Document Structure Analysis Separating Content from Navigation Elements [C] . Mantratzis, C., Cassidy, . 2005

机译：基于DOM的XHTML文档结构分析，将内容与导航元素分离
5. Context-based content extraction of HTML documents. [D] . Gupta, Suhit. 2006

机译：HTML文档的基于上下文的内容提取。
6. Large-Scale Data Mining of Rapid Residue Detection Assay Data From HTML and PDF Documents: Improving Data Access and Visualization for Veterinarians [O] . Majid Jaberi-Douraki, Soudabeh Taghian Dinani, Nuwan Indika Millagaha Gedara, 2021

机译：来自HTML和PDF文件的快速残留检测测定数据的大规模数据挖掘：改善兽医的数据访问和可视化
7. DOM-based Content Extraction of HTML Documents [O] . Suhit Gupta, Gail Kaiser, David Neistadt, 2003

机译：基于DOm的HTmL文档内容提取

DOM-based Content Extraction of HTML Documents

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅