Organizing, indexing, and searching large-scale file systems.

机译：组织，索引和搜索大型文件系统。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

The world is moving towards a digital infrastructure. This move is driving the demand for data storage and has already resulted in file systems that contain petabytes of data and billions of files. In the near future file systems will be storing exabytes of data and trillions of files. This data growth has introduced the key question of how we effectively find and manage data in this growing sea of information. Unfortunately, file organization and retrieval methods have not kept pace with data volumes. Large-scale file systems continue to rely on hierarchical namespaces that make finding and managing files difficult.;As a result, there has been an increasing demand for search-based file access. A number of commercial file search solutions have become popular on desktop and small-scale enterprise systems. However, providing effective search and indexing at the scale of billions of files is not a simple task. Current solutions rely on general-purpose index designs, such as relational databases, to provide search. General-purpose indexes can be ill-suited for file system search and can limit performance and scalability. Additionally, current search solutions are designed as applications that are separate from the file system. Providing search through a separate application requires file attributes and modifications to be replicated into separate index structures, which presents consistency and efficiency problems at large-scales.;This thesis addresses these problems through novel approaches to organizing, indexing, and searching files in large-scale file systems. We conduct an analysis of large-scale file system properties using workload and snapshot traces to better understand the kinds of data being stored and how it is used. This analysis represents the first major workload study since 2001 and the first major study of enterprise file system contents and workloads in over a decade. Our analysis shows a number of important workload properties have changed since previous studies (e. g., read to write byte ratios have decreased to 2:1 from 4:1 or higher in past studies) and examines properties that are relevant to file organization and search. Other important observations include highly skewed workload distributions and clustering of metadata attribute values in the namespace.;We hypothesize that file search performance and scalability can be improved with file system specific index solutions. We present the design of new file metadata and file content indexing approaches that exploit key file system properties from our study. These designs introduce novel file system optimized index partitioning, query execution, and versioning techniques. We show that search performance can be improved up to 1--4 orders of magnitude compared to traditional approaches. Additionally, we hypothesize that directly integrating search into the file system can address the consistency and efficiency problems with separate search applications. We present new metadata and semantic file system designs that introduce novel disk layout, indexing, and updating methods to enable effective search without degrading normal file system performance. We then discuss on going challenges and how this work may be extended in the future.

机译：世界正朝着数字基础设施迈进。此举推动了对数据存储的需求，并且已经导致包含PB级数据和数十亿个文件的文件系统。在不久的将来，文件系统将存储数十亿字节的数据和数万亿个文件。数据的增长提出了关键问题，即我们如何在不断增长的信息海中有效地查找和管理数据。不幸的是，文件组织和检索方法未能与数据量保持同步。大型文件系统继续依赖于分层名称空间，这使查找和管理文件变得困难。因此，对基于搜索的文件访问的需求不断增长。许多商业文件搜索解决方案已在台式机和小型企业系统上流行。但是，提供数十亿个文件规模的有效搜索和索引并不是一件容易的事。当前的解决方案依靠诸如关系数据库之类的通用索引设计来提供搜索。通用索引可能不适用于文件系统搜索，并且可能会限制性能和可伸缩性。此外，当前的搜索解决方案被设计为与文件系统分开的应用程序。通过单独的应用程序提供搜索需要将文件属性和修改复制到单独的索引结构中，这在很大程度上带来了一致性和效率问题。本文通过新颖的方法来组织，索引和搜索大型文件扩展文件系统。我们使用工作负载和快照跟踪对大型文件系统属性进行分析，以更好地了解所存储数据的种类及其使用方式。该分析代表了自2001年以来的首次主要工作量研究，也是十年来对企业文件系统内容和工作量的首次重要研究。我们的分析表明，自从先前的研究以来，许多重要的工作负载属性发生了变化（例如，读写字节比从过去的研究中的4：1或更高的比例降低到2：1），并检查了与文件组织和搜索相关的属性。其他重要的观察结果包括工作空间分布高度偏斜以及名称空间中元数据属性值的聚类。我们假设可以通过文件系统特定的索引解决方案来提高文件搜索性能和可伸缩性。我们介绍了利用我们研究中的关键文件系统属性的新文件元数据和文件内容索引方法的设计。这些设计引入了新颖的文件系统优化的索引分区，查询执行和版本控制技术。我们显示，与传统方法相比，搜索性能最多可以提高1--4个数量级。此外，我们假设将搜索直接集成到文件系统中可以解决单独的搜索应用程序的一致性和效率问题。我们介绍了新的元数据和语义文件系统设计，这些设计引入了新颖的磁盘布局，索引编制和更新方法，以实现有效的搜索而不会降低正常的文件系统性能。然后，我们讨论了所面临的挑战以及将来如何扩展这项工作。

著录项

作者
Leung, Andrew W.;
展开▼
作者单位

University of California, Santa Cruz.;

展开▼
授予单位 University of California, Santa Cruz.;
学科 Computer Science.
学位 Ph.D.
年度 2009
页码 186 p.
总页数 186
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Indexing And Searching Image Files [J] . Adelene Ng Dr. Dobb's Journal . 2008,第10期

机译：索引和搜索图像文件
2. Probabilistic file indexing and searching in unstructured peer-to-peer networks [J] . An-Hsun Cheng, Yuh-Jzer Joung Computer networks . 2006,第1期

机译：非结构化对等网络中的概率文件索引和搜索
3. Cut down on time spent searching for files by organizing with folder views [J] . Kristi Gaylord Inside Microsoft Windows XP . 2005,第7期

机译：通过组织文件夹视图来减少搜索文件所花费的时间
4. Indexing Blocks to Reduce Space and Time Requirements for Searching Large Data Files [C] . Tzuhsien Wu, Hao Shyng, Jerry Chou, IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing . 2016

机译：索引块以减少搜索大数据文件的空间和时间要求
5. Indexing, searching, and mining large-scale visual data via structured vector quantization. [D] . Yuan, Jiangbo. 2014

机译：通过结构化矢量量化索引，搜索和挖掘大规模可视数据。
6. miRAFinder and GeneAFinder scripts: large-scale searching for miRNA and related information in indexed literature abstracts [O] . Olga Berillo, Mireille Régnier, Anatoly Ivashchenko 2014

机译：miRAFinder和GeneAFinder脚本：在索引文献摘要中大规模搜索miRNA和相关信息
7. A Study of Computer Literature Searching Among Drug Information. A Comparative Study of the Difference in Time Frame for an Article to be Indexed in the CD-ROM and ON-LINE MEDLINE Systems. [O] . KOUJIRO FUTAGAMI, ATSUKO NISHIYAMAGUCHI, HAJIME ASAKURA, 1995

机译：药物信息中计算机文学研究的研究。在CD-ROM和在线MEDLINE系统中索引物品的时间帧差异的比较研究。

Organizing, indexing, and searching large-scale file systems.

摘要

著录项

相似文献

相关主题

期刊订阅