首页> 外文学位 >Discovering web structure with multiple experts in a clustering framework.
【24h】

Discovering web structure with multiple experts in a clustering framework.

机译:在集群框架中与多位专家一起发现Web结构。

获取原文
获取原文并翻译 | 示例

摘要

The world wide web contains vast amounts of data, but only a small portion of it is accessible in an operational form by machines. The rest of this vast collection is behind a presentation layer that renders web pages in a human-friendly form but also hampers machine-processing of data. The task of converting web data into operational form is the task of data extraction. Current approaches to data extraction from the web either require human-effort to guide supervised learning algorithms or are customized to extract a narrow range of data types in specific domains. We focus on the broader problem of discovering the underlying structure of any database-generated web site. Our approach automatically discovers relational data that is hidden behind these web sites by combining experts that identify the relationship between surface structure and the underlying structure.;Our approach is to have a set of software experts that analyze a web site's pages. Each of these experts is specialized to recognize a particular type of structure. These experts discover similarities between data items within the context of the particular types of structure they analyze and output their discoveries as hypotheses in a common hypothesis language. We find the most likely clustering of data using a probabilistic framework in which the hypotheses provide the evidence. From the clusters, the relational form of the data is derived.;We develop two frameworks following the principles of our approach. The first framework introduces a common hypothesis language in which heterogeneous experts express their discoveries. The second framework extends the common language to allow experts to assign confidence scores to their hypotheses.;We experiment in the web domain by comparing the output of our approach to the data extracted by a supervised wrapper-induction system and validated manually. Our results show that our approach performs well in the data extraction task on a variety of web sites.;Our approach is applicable to other structure discovery problems as well. We demonstrate this by successfully applying our approach in the record deduplication domain.
机译:万维网包含大量数据,但是机器只能以操作形式访问其中的一小部分。此庞大集合的其余部分位于表示层的后面,该表示层以人类友好的形式呈现网页,但也妨碍了数据的机器处理。将Web数据转换为操作形式的任务是数据提取的任务。当前从Web上提取数据的方法要么需要人工来指导有监督的学习算法,要么经过定制以提取特定领域中的各种数据类型。我们专注于发现数据库生成的网站的基础结构这一更广泛的问题。我们的方法是通过组合识别表面结构和底层结构之间关系的专家来自动发现隐藏在这些网站后面的关系数据。我们的方法是拥有一组分析网站页面的软件专家。这些专家中的每一个都是专门识别特定类型的结构的专家。这些专家在他们分析的特定结构类型的上下文中发现数据项之间的相似性,并以共同的假设语言将其发现作为假设输出。我们使用概率框架找到了最可能的数据聚类,其中的假设为证据提供了证据。从聚类中,得出数据的关系形式。我们遵循方法的原理开发了两个框架。第一个框架引入了一种通用的假设语言,异类专家使用这种语言来表达他们的发现。第二个框架扩展了通用语言,允许专家为他们的假设分配置信度分数。我们在网络领域进行实验,方法是将我们的方法的输出结果与监督包装诱导系统提取的数据进行比较,并进行手动验证。我们的结果表明,我们的方法在各种网站上的数据提取任务中表现良好。;我们的方法也适用于其他结构发现问题。我们通过在记录重复数据删除领域成功应用我们的方法来证明这一点。

著录项

  • 作者

    Gazen, Bora Cenk.;

  • 作者单位

    Carnegie Mellon University.;

  • 授予单位 Carnegie Mellon University.;
  • 学科 Artificial Intelligence.;Computer Science.
  • 学位 Ph.D.
  • 年度 2008
  • 页码 120 p.
  • 总页数 120
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 人工智能理论;自动化技术、计算机技术;
  • 关键词

  • 入库时间 2022-08-17 11:38:40

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号