首页> 外文会议>International Conference on Bangla Speech and Language Processing >An Efficient Technique for Representation and Compression of Bengali Text
【24h】

An Efficient Technique for Representation and Compression of Bengali Text

机译:一种有效的孟加拉语文本表示和压缩技术

获取原文

摘要

Text representation and compression of natural languages have become one of the challenging research aspects in recent times. It bears more significance for Bengali language as it exposes more complicated structures. Some works have been done on Bengali text compression. However, they may not produce notable compression performance in case of the presence of huge amount of conjugate characters. In this paper, we present an efficient technique for the representation and compression of Bengali document to obtain better compression gain in a computationally inexpensive manner. In the proposed approach, each Bengali single character is represented by a unique 2-digit decimal value whereas a conjugate character is represented by a 4-digit unique decimal value. The decimal value of a word is formed using the decimal values of its constituent characters. Then, indexing and sorting all the word values, a successive subtraction operation is accomplished on the sorted word values to reduce the weight of the numbers. The newly produced decimal values of the words can now be encoded with relatively few bits for the efficient storage or transmission. The experimental result shows that the proposed technique provides a better average improvement on compression ratio using 5 different Bengali datasets than that of the various existing compression schemes such as WinZip (30.74%), Win-RAR (19.75%) and 7-Zip (16.56%).
机译:文本表示和自然语言的压缩已成为近来具有挑战性的研究内容之一。它对孟加拉语语言具有更重要的意义,因为它暴露了更复杂的结构。在孟加拉语文本压缩方面已经完成了一些工作。但是,在存在大量共轭字符的情况下,它们可能不会产生明显的压缩性能。在本文中,我们提出了一种有效的技术,用于孟加拉语文档的表示和压缩,从而以计算上不昂贵的方式获得更好的压缩增益。在所提出的方法中,每个孟加拉语单个字符由一个唯一的2位十进制值表示,而共轭字符则由一个4位唯一的十进制值表示。单词的十进制值是使用其组成字符的十进制值形成的。然后,对所有单词值进行索引和排序,对排序后的单词值执行连续的减法操作,以减少数字的权重。现在可以用相对较少的比特对新产生的单词的十进制值进行编码,以进行有效的存储或传输。实验结果表明,与WinZip(30.74%),Win-RAR(19.75%)和7-Zip(16.56)等各种现有压缩方案相比,所提出的技术使用5个不同的Bengali数据集可提供更好的平均压缩率改进。 %)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号