首页> 外文会议>Fifth International Conference on Digital Information Management >Text data compression ratio as a text attribute for a language-independent text art extraction method
【24h】

Text data compression ratio as a text attribute for a language-independent text art extraction method

机译:文本数据压缩率作为独立于语言的文本艺术提取方法的文本属性

获取原文

摘要

Text based pictures called text art are often used in Web pages, email text and so on. They enrich expression in text data, but they can be noise for handling the text data. For example, they can be obstacle for text-to-speech software and natural language processing. Text art extraction methods, which detects the area of text art in a given text data, help to solve such problems. Previously proposed text art extraction methods, however, will not work for text data with more than one natural languages well because they assume that a specific natural language is used in text data. We have proposed a text art extraction method for multi natural languages in our past paper. The extraction method uses an attribute based on successive occurrences of same two characters. The attribute represents a characteristic such that same characters often appear successively in text art. In this paper, we use two data compression ratios of text data instead of the attribute in the our extraction method, namely compression ratio by Run Length Encoding (RLE) and that by LZ77. Our experiments show that our extraction method with compression ratio by RLE works better than both that with compression ratio by LZ77 and our previous extraction method.
机译:基于文本的图片称为文本艺术,通常用于网页,电子邮件文本等中。它们丰富了文本数据中的表达,但是它们对于处理文本数据可能会产生干扰。例如,它们可能成为文本到语音软件和自然语言处理的障碍。检测给定文本数据中的文本艺术区域的文本艺术提取方法有助于解决此类问题。然而,先前提出的文本艺术提取方法不适用于具有一种以上自然语言的文本数据,因为它们假定文本数据中使用了特定的自然语言。在过去的论文中,我们提出了一种针对多种自然语言的文字艺术提取方法。提取方法使用基于相同两个字符的连续出现的属性。该属性表示这样的特征,使得相同的字符经常在文字艺术中连续出现。在本文中,我们使用两种数据压缩率代替文本数据的提取方法中的属性,即运行长度编码(RLE)和LZ77的压缩率。我们的实验表明,采用RLE压缩比的提取方法比采用LZ77压缩比的提取方法和以前的提取方法都效果更好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号