首页> 外文期刊>Annals Data Science >Fractal Dimension Calculation for Big Data Using Box Locality Index
【24h】

Fractal Dimension Calculation for Big Data Using Box Locality Index

机译:利用盒局部性指数计算大数据的分形维数

获取原文
获取原文并翻译 | 示例
           

摘要

The box - counting approach for fractal dimension calculation is scaled up for big data using a data structure named box locality index (BLI). The BLI is constructed as key-value pairs with the key indexing the location of a “box” (i.e., a grid cell on the multi-dimensional space) and the value counting the number of data points inside the box (i.e., “box occupancy”). Such a key-value pair structure of BLI significantly simplifies the traditionally used hierarchical structure and encodes only necessary information required by the box - counting approach for fractal dimension calculation. Moreover, as the box occupancies (i.e., the values) associated with the same index (i.e., the key) are aggregatable, the BLI grants the box - counting approach the needed scalability for fractal dimension calculation of big data using distributed computing techniques (e.g., MapReduce and Spark). Taking the advantage of the BLI, MapReduce and Spark methods for fractal dimension calculation of big data are developed, which conduct box - counting for each grid level as a cascade of MapReduce/Spark jobs in a bottom-up fashion. In an empirical validation, the MapReduce and Spark methods demonstrated good effectiveness and efficiency in fractal calculation of a big synthetic dataset. In summary, this work provides an efficient solution for estimating the intrinsic dimension of big data, which is essential for many machine learning methods and data analytics including feature selection and dimensionality reduction.
机译:使用名为盒局部性索引(BLI)的数据结构,针对大数据扩大了用于分形维数计算的盒计数方法。 BLI被构造为键-值对,其中的键索引了“框”(即多维空间上的网格单元)的位置,并为框内的数据点数量计数的值(即“框”占用”)。 BLI的这种键值对结构大大简化了传统上使用的层次结构,并且仅对分形维数计算的盒计数方法所需的必要信息进行编码。此外,由于与同一索引(即键)关联的框占用(即值)是可聚合的,因此BLI授予框计数方法使用分布式计算技术(例如,大数据的分形维计算)所需的可伸缩性。 ,MapReduce和Spark)。利用BLI的优势,开发了MapReduce和Spark方法来进行大数据的分形维数计算,该方法以自下而上的方式对每个网格级别进行盒装计数,作为MapReduce / Spark作业的级联。通过经验验证,MapReduce和Spark方法在大型综合数据集的分形计算中显示出良好的有效性和效率。总而言之,这项工作为估算大数据的内在维度提供了一种有效的解决方案,这对于许多机器学习方法和数据分析(包括特征选择和降维)都是必不可少的。

著录项

  • 来源
    《Annals Data Science》 |2018年第4期|549-563|共15页
  • 作者单位

    Institute of the Environment and Sustainability, University of California;

    Advanced Computing, Mathematics, and Data Division, Pacific Northwest National Laboratory;

    Institute of the Environment and Sustainability, University of California,Chemical and Biomolecular Engineering Department, University of California,Center for Environmental Implications of Nanotechnology, University of California;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Fractal dimension; Intrinsic dimension; Box-counting; Box locality index; MapReduce; Spark;

    机译:分形维数;本征维数;盒计数;盒局部性指数;MapReduce;火花;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号