首页> 美国卫生研究院文献>other >Mynodbcsv: Lightweight Zero-Config Database Solution for Handling Very Large CSV Files
【2h】

Mynodbcsv: Lightweight Zero-Config Database Solution for Handling Very Large CSV Files

机译:Mynodbcsv:轻量级零配置数据库解决方案用于处理非常大的CSV文件

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Volumes of data used in science and industry are growing rapidly. When researchers face the challenge of analyzing them, their format is often the first obstacle. Lack of standardized ways of exploring different data layouts requires an effort each time to solve the problem from scratch. Possibility to access data in a rich, uniform manner, e.g. using Structured Query Language (SQL) would offer expressiveness and user-friendliness. Comma-separated values (CSV) are one of the most common data storage formats. Despite its simplicity, with growing file size handling it becomes non-trivial. Importing CSVs into existing databases is time-consuming and troublesome, or even impossible if its horizontal dimension reaches thousands of columns. Most databases are optimized for handling large number of rows rather than columns, therefore, performance for datasets with non-typical layouts is often unacceptable. Other challenges include schema creation, updates and repeated data imports. To address the above-mentioned problems, I present a system for accessing very large CSV-based datasets by means of SQL. It's characterized by: “no copy” approach – data stay mostly in the CSV files; “zero configuration” – no need to specify database schema; written in C++, with boost , SQLite and Qt , doesn't require installation and has very small size; query rewriting, dynamic creation of indices for appropriate columns and static data retrieval directly from CSV files ensure efficient plan execution; effortless support for millions of columns; due to per-value typing, using mixed textumbers data is easy; very simple network protocol provides efficient interface for MATLAB and reduces implementation time for other languages. The software is available as freeware along with educational videos on its website . It doesn't need any prerequisites to run, as all of the libraries are included in the distribution package. I test it against existing database solutions using a battery of benchmarks and discuss the results.
机译:科学和工业中使用的数据量正在迅速增长。当研究人员面临分析它们的挑战时,其格式通常是第一个障碍。缺乏探索不同数据布局的标准化方法,每次都需要付出努力以从头解决问题。可以以丰富,统一的方式访问数据,例如使用结构化查询语言(SQL)将提供表现力和用户友好性。逗号分隔值(CSV)是最常见的数据存储格式之一。尽管它很简单,但是随着文件大小的增长,它变得不平凡。将CSV导入到现有数据库中既费时又麻烦,如果其水平尺寸达到数千列,甚至是不可能的。大多数数据库已针对处理大量行而不是列进行了优化,因此,具有非典型布局的数据集的性能通常是不可接受的。其他挑战包括架构创建,更新和重复数据导入。为了解决上述问题,我提出了一种通过SQL访问基于CSV的大型数据集的系统。它的特点是:“无复制”方法–数据大部分保留在CSV文件中; “零配置” –无需指定数据库架构;用boost,SQLite和Qt用C ++编写,不需要安装,并且体积很小。查询重写,为适当的列动态创建索引以及直接从CSV文件中检索静态数据可确保有效地执行计划;毫不费力地支持数百万列;由于按值输入,使用混合文本/数字数据很容易;非常简单的网络协议为MATLAB提供了有效的接口,并减少了其他语言的实现时间。该软件可免费下载,其网站上还提供教育视频。它不需要任何先决条件,因为所有库都包含在分发包中。我使用一系列基准针对现有数据库解决方案进行了测试,并讨论了结果。

著录项

  • 期刊名称 other
  • 作者

    Stanisław Adaszewski;

  • 作者单位
  • 年(卷),期 -1(9),7
  • 年度 -1
  • 页码 e103319
  • 总页数 8
  • 原文格式 PDF
  • 正文语种
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号