首页> 外文学位 >Context-based content extraction of HTML documents.
【24h】

Context-based content extraction of HTML documents.

机译:HTML文档的基于上下文的内容提取。

获取原文
获取原文并翻译 | 示例

摘要

Web pages often contain "clutter" (defined by us as unnecessary images, navigational menus and extraneous links) around the body of an article that may distract a user from actual content. Extraction of "useful and relevant" content from web pages has many applications, including speech rendering for the visually disabled, cell phone and PDA browsing, and text summarization. Most existing approaches to making content more directly accessible involve changing font size or removing HTML and data components such as images, which may take away from a webpage's inherent look and feel. Unlike "Content Reformatting", which aims to reproduce the entire webpage in a more convenient form, our solution directly addresses "Content Extraction" and "Clutter Removal".; We introduce Crunch, a framework that employs an easily extensible set of techniques, for enabling and integrating heuristics concerned with "content extraction" from HTML web pages. Crunch is implemented as a transparent web proxy and is practically usable by end-users. We use DOM tree based content extraction rather than directly processing HTML as flat files. Crunch is a versatile solution, allowing programmers and administrators to add heuristics to the framework. These heuristics act as filters that can be parameterized and toggled to perform the content extraction. Crunch reduces human involvement in the application of thresholds for the heuristics by automatically detecting and utilizing the content genre of a given website. Genre detection is accomplished via the use of frequency distributions of words associated with the website and associated search engine snippets. These distributions are used to improve the extraction process by comparing them to previously known results that work well for certain genres of sites and utilizing those settings.; We have measured the usability and performance of the content extraction proxy in terms of the quality of the output generated by the heuristics that act as filters after the proxy has inferred the context of a webpage. Ultimately, we show that rather than going with current approaches that are pre-packaged "one size fits all" and programmer controlled, going with a more flexible approach will produce a more content-full result.
机译:网页通常在文章正文周围包含“混乱”(由我们定义为不必要的图像,导航菜单和无关链接),这可能会使用户偏离实际内容。从网页中提取“有用且相关”的内容具有许多应用,包括为视障人士提供语音呈现,手机和PDA浏览以及文本摘要。使内容更直接可访问的大多数现有方法包括更改字体大小或删除HTML和数据组件(例如图像),这可能会破坏网页的固有外观。与旨在以更方便的形式再现整个网页的“内容重新格式化”不同,我们的解决方案直接解决了“内容提取”和“杂物去除”问题。我们介绍Crunch,该框架采用了一组易于扩展的技术,用于启用和集成与HTML网页中“内容提取”有关的启发式方法。 Crunch被实现为透明的Web代理,并且实际上可供最终用户使用。我们使用基于DOM树的内容提取,而不是直接将HTML作为平面文件处理。 Crunch是一种通用解决方案,允许程序员和管理员将启发式方法添加到框架中。这些试探法可以用作过滤器,可以对其进行参数化和切换以执行内容提取。通过自动检测和利用给定网站的内容类型,Crunch减少了人类对启发式阈值应用的参与。通过使用与网站和搜索引擎摘要相关联的单词的频率分布来完成体裁检测。通过将这些分布与以前已知的结果进行比较,以改善提取过程,这些已知结果对某些类型的场所非常有用,并利用这些设置。我们已经根据启发式算法在代理推断出网页上下文之后用作过滤器的输出的质量来衡量内容提取代理的可用性和性能。最终,我们表明,与其采用预先包装好的“一刀切”且受程序员控制的当前方法,不如采用一种更加灵活的方法,将产生更全面的结果。

著录项

  • 作者

    Gupta, Suhit.;

  • 作者单位

    Columbia University.;

  • 授予单位 Columbia University.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2006
  • 页码 190 p.
  • 总页数 190
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号