首页> 外文学位 >Towards efficient data analysis and management of semi-structured data.
【24h】

Towards efficient data analysis and management of semi-structured data.

机译:致力于高效的数据分析和半结构化数据的管理。

获取原文
获取原文并翻译 | 示例

摘要

Over the last decade, there has been an enormous growth in both the amount and the complexity of online content that is collected and processed by humans and machines. Such a growth has spurred interest in flexible and fluid (semi-structured) data models that do not constrain the data to follow a fixed schema. Many applications ranging from bioinformatics to XML repositories, from software engineering to computational linguistics, are now generating and processing large amounts of semi-structured data. For these applications to reach their full potential, we need to build an effective set of tools to index, process, manage, and analyze such data. This dissertation focuses on a specific class of semi-structured data that is denoted using hierarchical tree objects. We specifically address the following questions pertaining to mining and managing tree-structured data: How can we provide quick access mechanisms to large semi-structured data stores? How can we discover hidden structural patterns from such data collections? How can we devise strategies to realize performance that is commensurate with modern computer architectures?;In the context of managing tree-structured data, first, we develop an indexing mechanism that extracts discriminant features from the database and indexes them using a simple tunable inverted structure. Such an index is complemented with an efficient holistic query processing technique that retrieves the matches by operating entirely on space-efficient sequential representation of trees. Second, we propose a framework that enables the development of application-specific hash functions that convert variable-sized graph and tree structured data into fixed-sized hash values. We demonstrate the usability of this framework by developing a hash-based distributed data placement service for semi-structured data. We argue that this service is capable of supporting large scale data management and data mining algorithms.;In the context of mining tree databases, first, we explore the role of succinct sequential data structures for efficiently discovering frequent tree patterns. Second, we propose a memory-conscious design" wherein the algorithms trade memory for redundant computations to improve the memory system performance. Third, we consider the case of deploying data mining workloads on modern multicore systems. Here, we demonstrate that the bandwidth to main memory becomes a precious shared commodity as one increases the number of cores present in the system. We present mechanisms to alleviate the bandwidth pressure and show their effectiveness. Fourth, we explore an adaptive task and data parallel algorithm design that facilitates effective parallelization in the presence of data and workload skew. This algorithm is integrated into a general purpose scheduling service that supports the development of adaptive and moldable algorithms for database and mining tasks. Finally, we develop a hash-based distributed data placement service that can support the development of large scale distributed data mining and data management applications.
机译:在过去的十年中,由人和机器收集和处理的在线内容的数量和复杂性都有了巨大的增长。这样的增长激发了人们对不限制数据遵循固定模式的灵活,流动(半结构)数据模型的兴趣。从生物信息学到XML存储库,从软件工程到计算语言学,许多应用程序现在都在生成和处理大量的半结构化数据。为了使这些应用程序发挥最大的潜力,我们需要构建一套有效的工具来对这些数据进行索引,处理,管理和分析。本文着眼于一类特定的半结构化数据,它使用分层树对象表示。我们专门解决与挖掘和管理树状结构数据有关的以下问题:我们如何提供对大型半结构化数据存储区的快速访问机制?我们如何从此类数据收集中发现隐藏的结构模式?我们如何才能设计出实现与现代计算机体系结构相称的性能的策略?;在管理树形数据的上下文中,首先,我们开发了一种索引机制,该索引机制从数据库中提取判别式特征,并使用简单的可调倒置结构对其进行索引。 。这种索引是通过高效的整体查询处理技术来补充的,该技术通过完全对空间高效的树顺序表示进行操作来检索匹配项。其次,我们提出了一个框架,该框架使开发特定于应用程序的哈希函数成为可能,该哈希函数将可变大小的图形和树状结构的数据转换为固定大小的哈希值。我们通过为半结构化数据开发基于散列的分布式数据放置服务来证明此框架的可用性。我们认为该服务能够支持大规模数据管理和数据挖掘算法。在挖掘树数据库的背景下,首先,我们探索简洁顺序数据结构在有效发现频繁树模式中的作用。其次,我们提出了一种内存意识设计”,其中算法将内存换为冗余计算以提高内存系统性能。第三,我们考虑了在现代多核系统上部署数据挖掘工作负载的情况。在这里,我们证明了带宽的主要优势随着人们增加系统中存在的内核数量,内存成为一种宝贵的共享商品;我们提出了缓解带宽压力并显示其有效性的机制;第四,我们探索了一种自适应任务和数据并行算法设计,该设计有助于在存在时进行有效并行化该算法已集成到通用调度服务中,该服务支持针对数据库和挖掘任务的自适应和可塑算法的开发;最后,我们开发了一种基于哈希的分布式数据放置服务,该服务可支持大型数据仓库的开发。扩展分布式数据挖掘和数据管理应用程序。

著录项

  • 作者

    Tatikonda, Shirish.;

  • 作者单位

    The Ohio State University.;

  • 授予单位 The Ohio State University.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2010
  • 页码 218 p.
  • 总页数 218
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号