...
首页> 外文期刊>Journal of Physics: Conference Series >Toward real-time data query systems in HEP
【24h】

Toward real-time data query systems in HEP

机译:走向HEP中的实时数据查询系统

获取原文
           

摘要

Exploratory data analysis tools must respond quickly to a user's questions, so that the answer to one question (e.g. a visualized histogram or fit) can influence the next. In some SQL-based query systems used in industry, even very large (petabyte) datasets can be summarized on a human timescale (seconds), employing techniques such as columnar data representation, caching, indexing, and code generation/JIT-compilation. This article describes progress toward realizing such a system for High Energy Physics (HEP), focusing on the intermediate problems of optimizing data access and calculations for "query sized" payloads, such as a single histogram or group of histograms, rather than large reconstruction or data-skimming jobs. These techniques include direct extraction of ROOT TBranches into Numpy arrays and compilation of Python analysis functions (rather than SQL) to be executed very quickly. We will also discuss the problem of caching and actively delivering jobs to worker nodes that have the necessary input data preloaded in cache. All of these pieces of the larger solution are available as standalone GitHub repositories, and could be used in current analyses.
机译:探索性数据分析工具必须快速响应用户的问题,以便对一个问题的答案(例如可视化的直方图或拟合)可以影响下一个问题。在一些工业中使用的基于SQL的查询系统中,甚至可以使用列数据表示,缓存,索引以及代码生成/ JIT编译等技术,在人类时间尺度(秒)上总结非常大的数据集。本文介绍了实现高能物理(HEP)系统的过程,重点关注优化数据访问和“查询大小”有效载荷(例如单个直方图或一组直方图)而不是大型重构或直方图的计算的中间问题。数据掠夺工作。这些技术包括将ROOT TB分支直接提取到Numpy数组中,以及可以快速执行的Python分析函数(而不是SQL)的编译。我们还将讨论将作业缓存并主动将作业交付给已在缓存中预加载了必要输入数据的工作程序节点的问题。所有较大解决方案的所有这些部分都可以作为独立的GitHub存储库提供,并且可以用于当前的分析中。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号