首页> 外文期刊>Frontiers of Information Technology & Electronic Engineering >TextGen: a realistic text data content generation method for modern storage system benchmarks
【24h】

TextGen: a realistic text data content generation method for modern storage system benchmarks

机译:TextGen:一种用于现代存储系统基准测试的逼真的文本数据内容生成方法

获取原文
       

摘要

Modern storage systems incorporate data compressors to improve their performance and capacity. As a result, data content can significantly influence the result of a storage system benchmark. Because real-world proprietary datasets are too large to be copied onto a test storage system, and most data cannot be shared due to privacy issues, a benchmark needs to generate data synthetically. To ensure that the result is accurate, it is necessary to generate data content based on the characterization of real-world data properties that influence the storage system performance during the execution of a benchmark. The existing approach, called SDGen, cannot guarantee that the benchmark result is accurate in storage systems that have built-in word-based compressors. The reason is that SDGen characterizes the properties that influence compression performance only at the byte level, and no properties are characterized at the word level. To address this problem, we present TextGen, a realistic text data content generation method for modern storage system benchmarks. TextGen builds the word corpus by segmenting real-world text datasets, and creates a word-frequency distribution by counting each word in the corpus. To improve data generation performance, the word-frequency distribution is fitted to a lognormal distribution by maximum likelihood estimation. The Monte Carlo approach is used to generate synthetic data. The running time of TextGen generation depends only on the expected data size, which means that the time complexity of TextGen is O(n). To evaluate TextGen, four real-world datasets were used to perform an experiment. The experimental results show that, compared with SDGen, the compression performance and compression ratio of the datasets generated by TextGen deviate less from real-world datasets when end-tagged dense code, a representative of word-based compressors, is evaluated.
机译:现代存储系统结合了数据压缩器以提高其性能和容量。结果,数据内容会严重影响存储系统基准测试的结果。由于现实世界中的专有数据集太大而无法复制到测试存储系统上,并且由于隐私问题无法共享大多数数据,因此基准测试需要综合生成数据。为了确保结果准确,有必要根据对基准测试执行过程中会影响存储系统性能的真实数据属性的表征来生成数据内容。现有的称为SDGen的方法无法保证基准测试结果在具有内置基于单词的压缩器的存储系统中是准确的。原因是SDGen仅在字节级别上描述影响压缩性能的属性,而在字级别上没有任何属性。为了解决这个问题,我们提出了TextGen,这是一种用于现代存储系统基准测试的逼真的文本数据内容生成方法。 TextGen通过分割现实世界中的文本数据集来构建单词语料库,并通过计算语料库中的每个单词来创建单词频率分布。为了提高数据生成性能,通过最大似然估计将词频分布拟合为对数正态分布。蒙特卡洛方法用于生成综合数据。 TextGen生成的运行时间仅取决于预期的数据大小,这意味着TextGen的时间复杂度为O(n)。为了评估TextGen,使用了四个真实世界的数据集来进行实验。实验结果表明,与SDGen相比,当评估以末端标记为代表的基于词的压缩器的密集代码时,TextGen生成的数据集的压缩性能和压缩率与实际数据集的偏差较小。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号