首页> 外文会议>International Conference on Information and Communication Technology >Development of word-based text compression algorithm for Indonesian language document
【24h】

Development of word-based text compression algorithm for Indonesian language document

机译:印尼语言文档中基于单词的文本压缩算法的开发

获取原文

摘要

Information technology is growing very rapidly, in particular for data handling. Data is a valuable asset for everyone, especially for larger companies with branches in several places. Data transmission from headquarters to branch offices make the company must provide good tools to do it. These companies also need tools that can be used to compress data to reduce their size. The main idea of the word-based encoding is to extract each word of the source text, then it is checked whether containing capital letters or not. After that, it is checked if there is a symbol or number. The particle will be separated from the basic word using stemming algorithm. Symbols, numbers and affixes will be indexed in the basic dictionary. The basic word will also be checked whether it exists in the basic dictionary or not. If there is not a match, then the word will be stored in the supplement dictionary. The experiment was conducted on the text file with the size from about 10K bytes up to 500K bytes with 16-bits length codewords. The result shows that the compression ratio of the proposed method is comparable with the previous ones, while its processing time is much better than the Reversed Sequence of Characters on LZW method.
机译:信息技术发展非常迅速,特别是在数据处理方面。数据是每个人的宝贵资产,尤其是对于在多个地方设有分支机构的大型公司而言。从总部到分支机构的数据传输使公司必须提供良好的工具来做到这一点。这些公司还需要可用于压缩数据以减小其大小的工具。基于单词的编码的主要思想是提取源文本中的每个单词,然后检查是否包含大写字母。之后,检查是否有符号或数字。使用词干提取算法,粒子将从基本单词中分离出来。符号,数字和后缀将在基本词典中建立索引。基本单词还将被检查是否存在于基本词典中。如果没有匹配项,则该单词将存储在补充字典中。实验是在文本文件上进行的,该文件的大小从10K字节到500K字节不等,长度为16位。结果表明,该方法的压缩率与以前的方法相当,但处理时间比LZW方法的字符逆序要好得多。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号