...
首页> 外文期刊>Proceedings of the IEEE >Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments
【24h】

Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments

机译:在并行和分布式环境中实现随机矩阵算法

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

In this era of large-scale data, distributed systems built on top of clusters of commodity hardware provide cheap and reliable storage and scalable processing of massive data. With cheap storage, instead of storing only currently relevant data, it is common to store as much data as possible, hoping that its value can be extracted later. In this way, exabytes (10 bytes) of data are being created on a daily basis. Extracting value from these data, however, requires scalable implementations of advanced analytical algorithms beyond simple data processing, e.g., statistical regression methods, linear algebra, and optimization algorithms. Most such traditional methods are designed to minimize floating-point operations, which is the dominant cost of in-memory computation on a single machine. In parallel and distributed environments, however, load balancing and communication, including disk and network input/output (I/O) , can easily dominate computation. These factors greatly increase the complexity of algorithm design and challenge traditional ways of thinking about the design of parallel and distributed algorithms. Here, we review recent work on developing and implementing randomized matrix algorithms in large-scale parallel and distributed environments. Randomized algorithms for matrix problems have received a great deal of attention in recent years, thus far typically either in theory or in machine learning applications or with implementations on a single machine. Our main focus is on the underlying theory and practical implementation of random projection and random sampling algorithms for very large very overdetermined (i.e., overconstrained) - and -regression problems. Randomization can be used in one of two related ways: ei- her to construct subsampled problems that can be solved, exactly or approximately, with traditional numerical methods; or to construct preconditioned versions of the original full problem that are easier to solve with traditional iterative algorithms. Theoretical results demonstrate that in near input-sparsity time and with only a few passes through the data one can obtain very strong relative-error approximate solutions, with high probability. Empirical results highlight the importance of various tradeoffs (e.g., between the time to construct an embedding and the conditioning quality of the embedding, between the relative importance of computation versus communication, etc.) and demonstrate that - and -regression problems can be solved to low, medium, or high precision in existing distributed systems on up to terabyte-sized data.
机译:在这个大规模数据的时代,建立在商品硬件集群之上的分布式系统提供了廉价,可靠的存储以及可扩展的海量数据处理。对于廉价的存储,通常存储尽可能多的数据,而不是仅存储当前相关的数据,希望以后可以提取其值。这样,每天就可以创建EB级(10字节)的数据。然而,从这些数据中提取价值需要超越简单数据处理的高级分析算法的可扩展实现,例如统计回归方法,线性代数和优化算法。大多数此类传统方法旨在最小化浮点运算,这是单台计算机上内存计算的主要成本。但是,在并行和分布式环境中,负载平衡和通信(包括磁盘和网络输入/输出(I / O))可以轻松地主导计算。这些因素大大增加了算法设计的复杂性,并挑战了有关并行和分布式算法设计的传统思维方式。在这里,我们回顾了在大规模并行和分布式环境中开发和实现随机矩阵算法的最新工作。近年来,矩阵问题的随机算法受到了广泛的关注,到目前为止,无论是在理论上还是在机器学习应用程序中,或者在单个机器上的实现方式上,通常都是如此。我们的主要重点是针对非常大的,非常确定的(即过度约束)和回归问题的随机投影和随机采样算法的基础理论和实际实现。可以通过以下两种相关方式之一使用随机化方法:进行随机抽样,以构造可以用传统数值方法精确或近似解决的子抽样问题;或构造原始完整问题的前提条件版本,这些条件条件版本可以使用传统的迭代算法轻松解决。理论结果表明,在接近输入稀疏度的时间内,只需几次通过数据,就可以以很高的概率获得非常强的相对误差近似解。实验结果强调了各种权衡的重要性(例如,在构建嵌入的时间与嵌入的条件质量之间,计算与通信的相对重要性之间的权衡等),并证明了-和-回归问题可以解决现有分布式系统中对TB级数据的低,中或高精度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号