首页> 外文学位 >Automatic wrapper generation for the extraction of search result records from search engines.
【24h】

Automatic wrapper generation for the extraction of search result records from search engines.

机译:自动包装器生成,用于从搜索引擎中提取搜索结果记录。

获取原文
获取原文并翻译 | 示例

摘要

The deep web, which is estimated about 500 times larger than that of the surface web, is extremely under-utilized. Researchers have been working on various issues towards the building of large-scale deep web applications, which aim at unleashing the real power of the deep web. One of the key issues facing large-scale deep applications is the extraction and understanding of the data returned by deep web sites. In order to utilize the data in deep web sites, we need to extract the data (search result records) from the search result pages, which are web pages that contain both the data of interest and other unrelated content, returned by the deep web sites. Data extraction from web pages is generally a very hard problem. The performances of existing researches in the literature are far from satisfactory.; This dissertation studies the problem of extracting search result records from search engine returned pages in both the deep web sites and the surface web sites. A method that combines both the visual content features and the HTML tag structures the result pages is proposed to generate wrappers for the extraction of search result records. This novel technique archives significantly better performance than that of the state-of-the-art researches.; To extract search result records from categorized result pages requires maintaining the section-record relationships. Major issues like section boundaries and optional sections make achieving a good performance difficult. We introduce a novel method based on the content properties of search result records and the dynamic properties of sections.; A search result record usually consists of multiple data units. The semi-structured nature of search result records makes the data units extraction a hard problem. The mismatches between the HTML tag structures and the data structure of search result records as well as the optional and disjunctive data units further limit the performance. We introduce a novel directed acyclic graph representation of search result record templates, which can be used to extract data units from search result records. An effective machine learning and statistics based algorithm that extracts templates from search result records is also presented.
机译:据估计,深网的利用率是表面网的500倍左右,但利用率却极低。研究人员一直在致力于构建大型深层Web应用程序的各种问题,目的是释放深层Web的真正力量。大型深度应用程序面临的关键问题之一是对深度网站返回的数据的提取和理解。为了利用深层网站中的数据,我们需要从搜索结果页面中提取数据(搜索结果记录),搜索结果页面是包含深层网站返回的感兴趣的数据和其他不相关内容的网页。从网页上提取数据通常是一个非常困难的问题。文献中现有研究的表现远不能令人满意。本文研究了从深层网站和地面网站的搜索引擎返回页面中提取搜索结果记录的问题。提出了一种将视觉内容特征和HTML标记结构结合到结果页中的方法,以生成用于提取搜索结果记录的包装器。这种新技术的存档性能明显优于最新研究。要从分类结果页面中提取搜索结果记录,需要保持部分记录关系。诸如节边界和可选节之类的主要问题使得很难获得良好的性能。我们基于搜索结果记录的内容属性和节的动态属性介绍一种新方法。搜索结果记录通常由多个数据单元组成。搜索结果记录的半结构化性质使数据单元提取成为一个难题。 HTML标记结构与搜索结果记录的数据结构以及可选和分离数据单元之间的不匹配进一步限制了性能。我们介绍了一种新颖的搜索结果记录模板的有向无环图表示形式,可用于从搜索结果记录中提取数据单元。还提出了一种有效的基于机器学习和统计的算法,该算法从搜索结果记录中提取模板。

著录项

  • 作者

    Zhao, Hongkun.;

  • 作者单位

    State University of New York at Binghamton.$bComputer Science.;

  • 授予单位 State University of New York at Binghamton.$bComputer Science.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2007
  • 页码 162 p.
  • 总页数 162
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

  • 入库时间 2022-08-17 11:39:19

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号