首页> 外文学位 >Pay-as-you-go data cleaning and integration.
【24h】

Pay-as-you-go data cleaning and integration.

机译:随用随付数据清理和集成。

获取原文
获取原文并翻译 | 示例

摘要

Many emerging applications such as Web mash-ups and large-scale sensor deployments seek to make use of large collections of heterogeneous data sources to enable powerful new services. These sources range from traditional sources such as relational databases to emerging sources such as structured data on the Web and streaming sensor data.;In order to realize the potential of these applications, however, the data from these disparate sources must be cleaned and integrated. In emerging data sources such as the Web and sensors, traditional cleaning and integration techniques are necessary, but not sufficient to deal with the unique challenges presented by this data. I argue that new techniques, based on the concept of pay-as-you-go are crucial for incorporating such data sources into applications. This concept provides a framework for building cleaning and integration solutions that are easy to deploy and maintain, efficiently leverage human feedback where possible, and automatically adapt their processing to the underlying data.;In this thesis, I contribute key building blocks designed to provide pay-as-you-go data cleaning and integration. Specifically, I develop the following techniques: Roomba, a technique for effectively involving user feedback to augment data cleaning mechanisms; Metaphysical Data Independence (MDI), a means of hiding all details of sensor data cleaning and integration under a single interface; SMURF an adaptive cleaning tool for providing MDI for RFID data; and ESP , a declarative-query based cleaning framework for sensor data streams. These techniques all embody key principles that underly the pay-as-you-go philosophy: ease of setup and deployment, adaptability, and incremental integration.;Additionally, I show that a focus on the pay-as-you-go philosophy does not preclude effective data cleaning and integration mechanisms. Indeed, in many cases the techniques developed in this thesis are capable of producing higher-quality data than current cleaning and integration techniques. For instance, effective use of human feedback is able to integrate data in a large-scale data integration scenario with half the human cost of current approaches. Similarly, an adaptive approach to cleaning RFID data is able to produce a three-fold reduction in data error rate in certain scenarios compared to the state-of-the-art RFID middleware solutions.;In summary, this thesis makes two broad contributions. First, it demonstrates that a pay-as-you-go approach to data cleaning and integration enables an emerging class of applications dependent on data derived from many heterogeneous data sources. Second, it proposes a suite of pay-as-you-go based data cleaning and integration techniques that provide a solid foundation on which to build the systems to support these applications.
机译:Web mash-up和大规模传感器部署等许多新兴应用程序试图利用大量的异构数据源集合来启用功能强大的新服务。这些资源包括从关系数据库等传统资源到Web上的结构化数据和流式传感器数据等新兴资源。;为了实现这些应用程序的潜力,但是,必须清理和集成来自这些不同资源的数据。在诸如Web和传感器之类的新兴数据源中,传统的清理和集成技术是必需的,但不足以应对此数据所带来的独特挑战。我认为基于现收现付概念的新技术对于将此类数据源整合到应用程序中至关重要。这个概念为构建清洁和集成解决方案提供了一个框架,该解决方案易于部署和维护,在可能的情况下有效利用人工反馈并自动将其处理适应基础数据。在本文中,我贡献了旨在提供报酬的关键构建块随手进行数据清理和集成。具体来说,我开发了以下技术:Roomba,一种有效吸收用户反馈以增强数据清理机制的技术;形而上数据独立性(MDI),一种将传感器数据清洗和集成的所有细节隐藏在单个界面下的方法; SMURF自适应清洁工具,用于为RFID数据提供MDI;和ESP,这是一个基于声明式查询的传感器数据流清理框架。这些技术都体现了按需付费的基本原理的关键原理:易于设置和部署,适应性和增量集成。;此外,我证明了关注按需付费的哲学并没有排除有效的数据清理和集成机制。确实,在许多情况下,与当前的清洗和集成技术相比,本文中开发的技术能够产生更高质量的数据。例如,有效利用人工反馈能够以目前方法一半的人工成本在大规模数据集成方案中集成数据。类似地,与最新的RFID中间件解决方案相比,一种自适应的RFID数据清洗方法在某些情况下能够使数据错误率降低三倍。总之,本文提出了两个广泛的贡献。首先,它证明了一种按需付费的数据清理和集成方法,使一类新兴的应用程序可以依赖于从许多异构数据源中获取的数据。其次,它提出了一套随用随付的数据清理和集成技术,为构建支持这些应用程序的系统提供了坚实的基础。

著录项

  • 作者

    Jeffery, Shawn R.;

  • 作者单位

    University of California, Berkeley.;

  • 授予单位 University of California, Berkeley.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2008
  • 页码 186 p.
  • 总页数 186
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号