TextGen: a realistic text data content generation method for modern storage system benchmarks

Long-xiang?Wang; Xiao-she?Dong; Xing-jun?Zhang; Yin-feng?Wang; Tao?Ju; Guo-fu?Feng

首页> 外文期刊>Frontiers of Information Technology & Electronic Engineering >TextGen: a realistic text data content generation method for modern storage system benchmarks

【24h】

TextGen: a realistic text data content generation method for modern storage system benchmarks

机译：TextGen：一种用于现代存储系统基准测试的逼真的文本数据内容生成方法

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Modern storage systems incorporate data compressors to improve their performance and capacity. As a result, data content can significantly influence the result of a storage system benchmark. Because real-world proprietary datasets are too large to be copied onto a test storage system, and most data cannot be shared due to privacy issues, a benchmark needs to generate data synthetically. To ensure that the result is accurate, it is necessary to generate data content based on the characterization of real-world data properties that influence the storage system performance during the execution of a benchmark. The existing approach, called SDGen, cannot guarantee that the benchmark result is accurate in storage systems that have built-in word-based compressors. The reason is that SDGen characterizes the properties that influence compression performance only at the byte level, and no properties are characterized at the word level. To address this problem, we present TextGen, a realistic text data content generation method for modern storage system benchmarks. TextGen builds the word corpus by segmenting real-world text datasets, and creates a word-frequency distribution by counting each word in the corpus. To improve data generation performance, the word-frequency distribution is fitted to a lognormal distribution by maximum likelihood estimation. The Monte Carlo approach is used to generate synthetic data. The running time of TextGen generation depends only on the expected data size, which means that the time complexity of TextGen is O(n). To evaluate TextGen, four real-world datasets were used to perform an experiment. The experimental results show that, compared with SDGen, the compression performance and compression ratio of the datasets generated by TextGen deviate less from real-world datasets when end-tagged dense code, a representative of word-based compressors, is evaluated.

机译：现代存储系统结合了数据压缩器以提高其性能和容量。结果，数据内容会严重影响存储系统基准测试的结果。由于现实世界中的专有数据集太大而无法复制到测试存储系统上，并且由于隐私问题无法共享大多数数据，因此基准测试需要综合生成数据。为了确保结果准确，有必要根据对基准测试执行过程中会影响存储系统性能的真实数据属性的表征来生成数据内容。现有的称为SDGen的方法无法保证基准测试结果在具有内置基于单词的压缩器的存储系统中是准确的。原因是SDGen仅在字节级别上描述影响压缩性能的属性，而在字级别上没有任何属性。为了解决这个问题，我们提出了TextGen，这是一种用于现代存储系统基准测试的逼真的文本数据内容生成方法。 TextGen通过分割现实世界中的文本数据集来构建单词语料库，并通过计算语料库中的每个单词来创建单词频率分布。为了提高数据生成性能，通过最大似然估计将词频分布拟合为对数正态分布。蒙特卡洛方法用于生成综合数据。 TextGen生成的运行时间仅取决于预期的数据大小，这意味着TextGen的时间复杂度为O（n）。为了评估TextGen，使用了四个真实世界的数据集来进行实验。实验结果表明，与SDGen相比，当评估以末端标记为代表的基于词的压缩器的密集代码时，TextGen生成的数据集的压缩性能和压缩率与实际数据集的偏差较小。

著录项

来源
《Frontiers of Information Technology & Electronic Engineering》 |2016年第10期|共12页
作者
Long-xiang?Wang; Xiao-she?Dong; Xing-jun?Zhang; Yin-feng?Wang; Tao?Ju; Guo-fu?Feng;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类 TP-570;
关键词

相似文献

外文文献
中文文献
专利

1. TextGen：a realistic text data content generation method for modern storage system benchmarks [J] . Long-xiang WANG, Xiao-she DONG, Xing-jun ZHANG, 浙江大学学报（英文版）（C辑：计算机与电子） . 2016,第010期

机译：TextGen：一种用于现代存储系统基准测试的逼真的文本数据内容生成方法
2. Patent Issued for Information Source Agent Systems and Methods for Distributed Data Storage and Management Using Content Signatures [J] . Journal of Engineering . 2013,第12期

机译：针对使用内容签名进行分布式数据存储和管理的信息源代理系统和方法颁发的专利
3. New position error signal generation method for SPM based data storage system [J] . Choong Woo Lee, Hyun Jae Kang, Chung Choo Chung, Microsystem Technologies . 2009,第10a11期

机译：基于SPM的数据存储系统的位置误差信号产生新方法
4. Realistic request arrival generation in storage benchmarks [C] . Pitchumani Rekha, Frank Shayna, Miller Ethan L. Symposium on Mass Storage Systems and Technologies . 2015

机译：在存储基准测试中生成实际的请求到达
5. Efficient Data and Space Management in Modern Storage Devices and Systems [D] . ?Wu, Fenggang 2020

机译：高效的数据和空间管理现代存储设备和系统
6. Benchmarking of computational error-correction methods for next-generation sequencing data [O] . Keith Mitchell, Jaqueline J. Brito, Igor Mandric, 2020

机译：下一代测序数据的计算纠错方法的基准测试
7. An evaluation of non-relational database management systems as suitable storage for user generated text-based content in a distributed environment [O] . Du Toit Petrus 2016

机译：评估非关系数据库管理系统作为分布式环境中用户生成的基于文本的内容的适当存储

TextGen: a realistic text data content generation method for modern storage system benchmarks

摘要

著录项

相似文献

相关主题

期刊订阅