首页> 外文学位 >Organizing, indexing, and searching large-scale file systems.
【24h】

Organizing, indexing, and searching large-scale file systems.

机译:组织,索引和搜索大型文件系统。

获取原文
获取原文并翻译 | 示例

摘要

The world is moving towards a digital infrastructure. This move is driving the demand for data storage and has already resulted in file systems that contain petabytes of data and billions of files. In the near future file systems will be storing exabytes of data and trillions of files. This data growth has introduced the key question of how we effectively find and manage data in this growing sea of information. Unfortunately, file organization and retrieval methods have not kept pace with data volumes. Large-scale file systems continue to rely on hierarchical namespaces that make finding and managing files difficult.;As a result, there has been an increasing demand for search-based file access. A number of commercial file search solutions have become popular on desktop and small-scale enterprise systems. However, providing effective search and indexing at the scale of billions of files is not a simple task. Current solutions rely on general-purpose index designs, such as relational databases, to provide search. General-purpose indexes can be ill-suited for file system search and can limit performance and scalability. Additionally, current search solutions are designed as applications that are separate from the file system. Providing search through a separate application requires file attributes and modifications to be replicated into separate index structures, which presents consistency and efficiency problems at large-scales.;This thesis addresses these problems through novel approaches to organizing, indexing, and searching files in large-scale file systems. We conduct an analysis of large-scale file system properties using workload and snapshot traces to better understand the kinds of data being stored and how it is used. This analysis represents the first major workload study since 2001 and the first major study of enterprise file system contents and workloads in over a decade. Our analysis shows a number of important workload properties have changed since previous studies (e. g., read to write byte ratios have decreased to 2:1 from 4:1 or higher in past studies) and examines properties that are relevant to file organization and search. Other important observations include highly skewed workload distributions and clustering of metadata attribute values in the namespace.;We hypothesize that file search performance and scalability can be improved with file system specific index solutions. We present the design of new file metadata and file content indexing approaches that exploit key file system properties from our study. These designs introduce novel file system optimized index partitioning, query execution, and versioning techniques. We show that search performance can be improved up to 1--4 orders of magnitude compared to traditional approaches. Additionally, we hypothesize that directly integrating search into the file system can address the consistency and efficiency problems with separate search applications. We present new metadata and semantic file system designs that introduce novel disk layout, indexing, and updating methods to enable effective search without degrading normal file system performance. We then discuss on going challenges and how this work may be extended in the future.
机译:世界正朝着数字基础设施迈进。此举推动了对数据存储的需求,并且已经导致包含PB级数据和数十亿个文件的文件系统。在不久的将来,文件系统将存储数十亿字节的数据和数万亿个文件。数据的增长提出了关键问题,即我们如何在不断增长的信息海中有效地查找和管理数据。不幸的是,文件组织和检索方法未能与数据量保持同步。大型文件系统继续依赖于分层名称空间,这使查找和管理文件变得困难。因此,对基于搜索的文件访问的需求不断增长。许多商业文件搜索解决方案已在台式机和小型企业系统上流行。但是,提供数十亿个文件规模的有效搜索和索引并不是一件容易的事。当前的解决方案依靠诸如关系数据库之类的通用索引设计来提供搜索。通用索引可能不适用于文件系统搜索,并且可能会限制性能和可伸缩性。此外,当前的搜索解决方案被设计为与文件系统分开的应用程序。通过单独的应用程序提供搜索需要将文件属性和修改复制到单独的索引结构中,这在很大程度上带来了一致性和效率问题。本文通过新颖的方法来组织,索引和搜索大型文件扩展文件系统。我们使用工作负载和快照跟踪对大型文件系统属性进行分析,以更好地了解所存储数据的种类及其使用方式。该分析代表了自2001年以来的首次主要工作量研究,也是十年来对企业文件系统内容和工作量的首次重要研究。我们的分析表明,自从先前的研究以来,许多重要的工作负载属性发生了变化(例如,读写字节比从过去的研究中的4:1或更高的比例降低到2:1),并检查了与文件组织和搜索相关的属性。其他重要的观察结果包括工作空间分布高度偏斜以及名称空间中元数据属性值的聚类。我们假设可以通过文件系统特定的索引解决方案来提高文件搜索性能和可伸缩性。我们介绍了利用我们研究中的关键文件系统属性的新文件元数据和文件内容索引方法的设计。这些设计引入了新颖的文件系统优化的索引分区,查询执行和版本控制技术。我们显示,与传统方法相比,搜索性能最多可以提高1--4个数量级。此外,我们假设将搜索直接集成到文件系统中可以解决单独的搜索应用程序的一致性和效率问题。我们介绍了新的元数据和语义文件系统设计,这些设计引入了新颖的磁盘布局,索引编制和更新方法,以实现有效的搜索而不会降低正常的文件系统性能。然后,我们讨论了所面临的挑战以及将来如何扩展这项工作。

著录项

  • 作者

    Leung, Andrew W.;

  • 作者单位

    University of California, Santa Cruz.;

  • 授予单位 University of California, Santa Cruz.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2009
  • 页码 186 p.
  • 总页数 186
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号