Context-based content extraction of HTML documents.

机译：HTML文档的基于上下文的内容提取。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Web pages often contain "clutter" (defined by us as unnecessary images, navigational menus and extraneous links) around the body of an article that may distract a user from actual content. Extraction of "useful and relevant" content from web pages has many applications, including speech rendering for the visually disabled, cell phone and PDA browsing, and text summarization. Most existing approaches to making content more directly accessible involve changing font size or removing HTML and data components such as images, which may take away from a webpage's inherent look and feel. Unlike "Content Reformatting", which aims to reproduce the entire webpage in a more convenient form, our solution directly addresses "Content Extraction" and "Clutter Removal".; We introduce Crunch, a framework that employs an easily extensible set of techniques, for enabling and integrating heuristics concerned with "content extraction" from HTML web pages. Crunch is implemented as a transparent web proxy and is practically usable by end-users. We use DOM tree based content extraction rather than directly processing HTML as flat files. Crunch is a versatile solution, allowing programmers and administrators to add heuristics to the framework. These heuristics act as filters that can be parameterized and toggled to perform the content extraction. Crunch reduces human involvement in the application of thresholds for the heuristics by automatically detecting and utilizing the content genre of a given website. Genre detection is accomplished via the use of frequency distributions of words associated with the website and associated search engine snippets. These distributions are used to improve the extraction process by comparing them to previously known results that work well for certain genres of sites and utilizing those settings.; We have measured the usability and performance of the content extraction proxy in terms of the quality of the output generated by the heuristics that act as filters after the proxy has inferred the context of a webpage. Ultimately, we show that rather than going with current approaches that are pre-packaged "one size fits all" and programmer controlled, going with a more flexible approach will produce a more content-full result.

机译：网页通常在文章正文周围包含“混乱”（由我们定义为不必要的图像，导航菜单和无关链接），这可能会使用户偏离实际内容。从网页中提取“有用且相关”的内容具有许多应用，包括为视障人士提供语音呈现，手机和PDA浏览以及文本摘要。使内容更直接可访问的大多数现有方法包括更改字体大小或删除HTML和数据组件（例如图像），这可能会破坏网页的固有外观。与旨在以更方便的形式再现整个网页的“内容重新格式化”不同，我们的解决方案直接解决了“内容提取”和“杂物去除”问题。我们介绍Crunch，该框架采用了一组易于扩展的技术，用于启用和集成与HTML网页中“内容提取”有关的启发式方法。 Crunch被实现为透明的Web代理，并且实际上可供最终用户使用。我们使用基于DOM树的内容提取，而不是直接将HTML作为平面文件处理。 Crunch是一种通用解决方案，允许程序员和管理员将启发式方法添加到框架中。这些试探法可以用作过滤器，可以对其进行参数化和切换以执行内容提取。通过自动检测和利用给定网站的内容类型，Crunch减少了人类对启发式阈值应用的参与。通过使用与网站和搜索引擎摘要相关联的单词的频率分布来完成体裁检测。通过将这些分布与以前已知的结果进行比较，以改善提取过程，这些已知结果对某些类型的场所非常有用，并利用这些设置。我们已经根据启发式算法在代理推断出网页上下文之后用作过滤器的输出的质量来衡量内容提取代理的可用性和性能。最终，我们表明，与其采用预先包装好的“一刀切”且受程序员控制的当前方法，不如采用一种更加灵活的方法，将产生更全面的结果。

著录项

作者
Gupta, Suhit.;
展开▼
作者单位

Columbia University.;

展开▼
授予单位 Columbia University.;
学科 Computer Science.
学位 Ph.D.
年度 2006
页码 190 p.
总页数 190
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. HTML Format Tables Extraction with Differentiating Cell Content as Property Name [J] . Detty Purnamasari, Lintang Yuniar Banowosari, I. Wayan Simri Wicaksana, Advanced Science Letters . 2014,第10a12期

机译：以格式区分单元格内容作为属性名称的HTML格式表提取
2. Relevance-based content extraction of HTML documents [J] . WU Qi, CHEN Xing-shu, ZHU Kai, 中南大学学报（英文版） . 2012,第007期

机译：HTML文档基于相关性的内容提取
3. Automating Content Extraction of HTML Documents [J] . SUHIT GUPTA, GAIL E. KAISER, PETER GRIMM, World Wide Web . 2005,第2期

机译：自动提取HTML文档的内容
4. Development of Browser Extension for HTML Web Page Content Extraction [C] . Murat KARABULUT, İslam MAYDA International Congress on Human-Computer Interaction, Optimization and Robotic Applications . 2020

机译：用于HTML网页内容提取的浏览器扩展的开发
5. Multi-stage modeling of HTML documents. [D] . Levering, Ryan Reed. 2004

机译：HTML文档的多阶段建模。
6. XML and its impact on content and structure in electronic health care documents. [O] . R. Sokolowski, J. Dudeck 1999

机译：XML及其对电子医疗文档中内容和结构的影响。
7. HTML Format Tables Extraction with Differentiating Cell Content as Property Name [O] . Purnamasari Detty, Banowosari Lintang Yuniar, Wicaksana I Wayan Simri, 2011

机译：以格式区分单元格内容作为属性名称的HTML格式表提取
8. DOM-based Content Extraction of HTML Documents [R] . Gupta, S. , Kaiser, G. , Neistadt, D. , 2005

机译：基于DOm的HTmL文档内容提取

Context-based content extraction of HTML documents.

摘要

著录项

相似文献

相关主题

期刊订阅