首页> 外文学位 >Automatic wrapper generation for the extraction of search result records from search engines.

【24h】

Automatic wrapper generation for the extraction of search result records from search engines.

机译：自动包装器生成，用于从搜索引擎中提取搜索结果记录。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

The deep web, which is estimated about 500 times larger than that of the surface web, is extremely under-utilized. Researchers have been working on various issues towards the building of large-scale deep web applications, which aim at unleashing the real power of the deep web. One of the key issues facing large-scale deep applications is the extraction and understanding of the data returned by deep web sites. In order to utilize the data in deep web sites, we need to extract the data (search result records) from the search result pages, which are web pages that contain both the data of interest and other unrelated content, returned by the deep web sites. Data extraction from web pages is generally a very hard problem. The performances of existing researches in the literature are far from satisfactory.; This dissertation studies the problem of extracting search result records from search engine returned pages in both the deep web sites and the surface web sites. A method that combines both the visual content features and the HTML tag structures the result pages is proposed to generate wrappers for the extraction of search result records. This novel technique archives significantly better performance than that of the state-of-the-art researches.; To extract search result records from categorized result pages requires maintaining the section-record relationships. Major issues like section boundaries and optional sections make achieving a good performance difficult. We introduce a novel method based on the content properties of search result records and the dynamic properties of sections.; A search result record usually consists of multiple data units. The semi-structured nature of search result records makes the data units extraction a hard problem. The mismatches between the HTML tag structures and the data structure of search result records as well as the optional and disjunctive data units further limit the performance. We introduce a novel directed acyclic graph representation of search result record templates, which can be used to extract data units from search result records. An effective machine learning and statistics based algorithm that extracts templates from search result records is also presented.

机译：据估计，深网的利用率是表面网的500倍左右，但利用率却极低。研究人员一直在致力于构建大型深层Web应用程序的各种问题，目的是释放深层Web的真正力量。大型深度应用程序面临的关键问题之一是对深度网站返回的数据的提取和理解。为了利用深层网站中的数据，我们需要从搜索结果页面中提取数据（搜索结果记录），搜索结果页面是包含深层网站返回的感兴趣的数据和其他不相关内容的网页。从网页上提取数据通常是一个非常困难的问题。文献中现有研究的表现远不能令人满意。本文研究了从深层网站和地面网站的搜索引擎返回页面中提取搜索结果记录的问题。提出了一种将视觉内容特征和HTML标记结构结合到结果页中的方法，以生成用于提取搜索结果记录的包装器。这种新技术的存档性能明显优于最新研究。要从分类结果页面中提取搜索结果记录，需要保持部分记录关系。诸如节边界和可选节之类的主要问题使得很难获得良好的性能。我们基于搜索结果记录的内容属性和节的动态属性介绍一种新方法。搜索结果记录通常由多个数据单元组成。搜索结果记录的半结构化性质使数据单元提取成为一个难题。 HTML标记结构与搜索结果记录的数据结构以及可选和分离数据单元之间的不匹配进一步限制了性能。我们介绍了一种新颖的搜索结果记录模板的有向无环图表示形式，可用于从搜索结果记录中提取数据单元。还提出了一种有效的基于机器学习和统计的算法，该算法从搜索结果记录中提取模板。

著录项

作者
Zhao, Hongkun.;
展开▼
作者单位

State University of New York at Binghamton.$bComputer Science.;

展开▼
授予单位 State University of New York at Binghamton.$bComputer Science.;
学科 Computer Science.
学位 Ph.D.
年度 2007
页码 162 p.
总页数 162
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;
关键词
入库时间 2022-08-17 11:39:19

相似文献

外文文献
中文文献
专利

1. 基于极大熵OWA算子的元搜索引擎搜索结果集成 [J] . 桑秀芝, 刘新旺东南大学学报（英文版） . 2013,第002期
2. Automatic Annotation Wrapper Generation and Mining Web Database Search Result [J] . V.Yogam, K.Umamaheswari International Journal of Innovative Research in Science, Engineering and Technology . 2014,第3期

机译：自动注释包装器生成和挖掘Web数据库搜索结果
3. An Agent Based System Framework for Mining Data Record Extraction from Search Engine Result Pages [J] . P.Kalaivani, Dr.K.L Shunmuganathan International Journal of Engineering Science and Technology . 2012,第4期

机译：基于Agent的系统框架，用于从搜索引擎结果页面中提取数据记录
4. Automatic extraction of user's search intention from web search logs [J] . Kinam Park, Hyesung Jee, Taemin Lee, Multimedia Tools and Applications . 2012,第1期

机译：从网络搜索日志中自动提取用户的搜索意图
5. Extraction of Automatic Search Result Records Using Content Density Algorithm Based on Node Similarity [C] . Yasar Gozudeli, Oktay Yildiz, Hacer Karacan, World Congress on Computing and Information Technology . 2015

机译：基于节点相似性的内容密度算法提取自动搜索结果记录
6. Automatic search interface clustering and search result processing in metasearch engine [D] . Lu, Yiyao 2011

机译：元搜索引擎中的自动搜索界面聚类和搜索结果处理
7. Natural language processing to extract symptoms of severe mental illness from clinical text: the Clinical Record Interactive Search Comprehensive Data Extraction (CRIS-CODE) project [O] . Richard G Jackson, Rashmi Patel, Nishamali Jayatilleke, 2017

机译：通过自然语言处理从临床文本中提取严重精神疾病的症状：临床记录交互式搜索综合数据提取（CRIS-CODE）项目
8. Fully automatic wrapper generation for search engines [O] . Hongkun Zhao, Weiyi Meng 2005

机译：搜索引擎的全自动包装器生成

Automatic wrapper generation for the extraction of search result records from search engines.

摘要

著录项

相似文献

相关主题

期刊订阅