【24h】

Evaluating text preprocessing to improve compression on maillogs

机译:评估文本预处理以改善邮件日志的压缩

获取原文

摘要

Maillogs contain important information about mail which has been sent or received. This information can be used for statistical purposes, to help prevent viruses or to help prevent SPAM. In order to satisfy regulations and follow good security practices, maillogs need to be monitored and archived. Since there is a large quantity of data, some form of data reduction is necessary. Data compression programs such as gzip and bzip2 are commonly used to reduce the quantity of data. Text preprocessing can be used to aid the compression of English text files. This paper evaluates whether text preprocessing, particularly word replacement, can be used to improve the compression of maillogs. It presents an algorithm for constructing a dictionary for word replacement and provides the results of experiments conducted using the ppmd, gzip, bzip2 and 7zip programs. These tests show that text preprocessing improves data compression on maillogs. Improvements of up to 56 percent in compression time and up to 32 percent in compression ratio are achieved. It also shows that a dictionary may be generated and used on other maillogs to yield reductions within half a percent of the results achieved for the maillog used to generate the dictionary.
机译:邮件日志包含有关已发送或已接收邮件的重要信息。此信息可用于统计目的,以帮助防止病毒或帮助防止垃圾邮件。为了满足法规并遵循良好的安全惯例,需要对邮件日志进行监视和存档。由于存在大量数据,因此需要某种形式的数据精简。诸如gzip和bzip2之类的数据压缩程序通常用于减少数据量。文本预处理可用于帮助压缩英语文本文件。本文评估了文本预处理(尤其是单词替换)是否可以用于改善邮件日志的压缩。它提供了一种构建用于单词替换的词典的算法,并提供了使用ppmd,gzip,bzip2和7zip程序进行的实验结果。这些测试表明,文本预处理可以改善邮件日志上的数据压缩。压缩时间最多可提高56%,压缩比最多可提高32%。它还显示可以生成词典并将其用于其他邮件日志,以将生成的邮件日志的结果减少一半。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号