【24h】

Hierarchical Clustering Approach to Text Compression

机译:文本压缩的分层聚类方法

获取原文

摘要

A novel data compression perspective is explored in this paper and focus is given on a new text compression algorithm based on clustering technique in Data Mining. Huffman encoding is enhanced through clustering, a non-trivial phase in the field of Data Mining for lossless text compression. The seminal hierarchical clustering technique has been modified in such a way that optimal number of words (patterns which are sequence of characters with a space as suffix) are obtained. These patterns are employed in the encoding process of our algorithm instead of single character-based code assignment approach of conventional Huffman encoding. Our approach is built on an efficient cosine similarity measure, which maximizes the compression ratio. Simulation of our proposed technique over benchmark corpus clearly shows the gain in compression ratio and time of our proposed work in relation to conventional Huffman encoding.
机译:在本文中探讨了一种新的数据压缩透视图,并在数据挖掘中基于聚类技术的新文本压缩算法给出了焦点。通过聚类增强了Huffman编码,在数据挖掘领域中的非平移阶段进行无损文本压缩。已经以这样的方式修改了最佳单词(作为后缀具有空间的字符序列的模式)的方式修改了精制的群集技术。这些模式在我们的算法的编码过程中采用,而不是传统霍夫曼编码的单一字符的代码分配方法。我们的方法是在高效的余弦相似度下构建,最大化压缩比。我们通过基准语料库的提出技术模拟清楚地显示了与传统霍夫曼编码相关的压缩比和时间的增益。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号