Development of word-based text compression algorithm for Indonesian language document

机译：印尼语言文档中基于单词的文本压缩算法的开发

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Information technology is growing very rapidly, in particular for data handling. Data is a valuable asset for everyone, especially for larger companies with branches in several places. Data transmission from headquarters to branch offices make the company must provide good tools to do it. These companies also need tools that can be used to compress data to reduce their size. The main idea of the word-based encoding is to extract each word of the source text, then it is checked whether containing capital letters or not. After that, it is checked if there is a symbol or number. The particle will be separated from the basic word using stemming algorithm. Symbols, numbers and affixes will be indexed in the basic dictionary. The basic word will also be checked whether it exists in the basic dictionary or not. If there is not a match, then the word will be stored in the supplement dictionary. The experiment was conducted on the text file with the size from about 10K bytes up to 500K bytes with 16-bits length codewords. The result shows that the compression ratio of the proposed method is comparable with the previous ones, while its processing time is much better than the Reversed Sequence of Characters on LZW method.

机译：信息技术发展非常迅速，特别是在数据处理方面。数据是每个人的宝贵资产，尤其是对于在多个地方设有分支机构的大型公司而言。从总部到分支机构的数据传输使公司必须提供良好的工具来做到这一点。这些公司还需要可用于压缩数据以减小其大小的工具。基于单词的编码的主要思想是提取源文本中的每个单词，然后检查是否包含大写字母。之后，检查是否有符号或数字。使用词干提取算法，粒子将从基本单词中分离出来。符号，数字和后缀将在基本词典中建立索引。基本单词还将被检查是否存在于基本词典中。如果没有匹配项，则该单词将存储在补充字典中。实验是在文本文件上进行的，该文件的大小从10K字节到500K字节不等，长度为16位。结果表明，该方法的压缩率与以前的方法相当，但处理时间比LZW方法的字符逆序要好得多。

著录项

来源
《International Conference on Information and Communication Technology》|2015年|450-454|共5页
会议地点
作者
Sinaga Ardiles; Adiwijaya; Nugroho Hertog;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Data Compression; LZW; Stemming; Tree Structure; Word-Based;

机译：数据压缩; LZW;词根;树结构;基于词;

相似文献

外文文献
中文文献
专利

1. Multi-Stream Word-Based Compression Algorithm for Compressed Text Search [J] . Ozturk Emir, Mesut Altan, Diri Banu Arabian Journal for Science and Engineering . 2018,第12期

机译：基于多流词的压缩文本搜索算法
2. INDONESIAN TEXT DOCUMENT SIMILARITY DETECTION SYSTEM USING RABIN-KARP AND CONFIX-STRIPPING ALGORITHMS [J] . Deardo Dibrianto Sinaga, Seng Hansun International Journal of Innovative Computing Information and Control . 2018,第5期

机译：基于RABIN-KARP和小量带算法的印尼文本文档相似度检测系统
3. Plagiarism Detection System for Indonesia Text Based Document by Fingerprint Method and Natural Language Processing Approach [J] . Advanced Science Letters . 2016,第10期

机译：指纹方法和自然语言处理方法印度尼西亚文本文档的抄袭检测系统
4. Development of word-based text compression algorithm for Indonesian language document [C] . Sinaga Ardiles, Adiwijaya, Nugroho Hertog International Conference on Information and Communication Technology . 2015

机译：印度尼西亚语言文档基于词的文本压缩算法的开发
5. Memory-efficient algorithms for raster document image compression. [D] . Figuera Alegre, Maribel. 2008

机译：光栅文档图像压缩的内存有效算法。
6. Swarm Intelligence Algorithms in Text Document Clustering with Various Benchmarks [O] . Suganya Selvaraj, Eunmi Choi 2021

机译：文本文档集群中的群智能算法与各种基准
7. Constructing Word-Based Text Compression Algorithms [O] . Nigel Horspool, Gordon Cormack 1992

机译：构造基于单词的文本压缩算法

Development of word-based text compression algorithm for Indonesian language document

摘要

著录项

相似文献

相关主题

期刊订阅