首页> 外文会议>2011 23rd IEEE International Conference on Tools with Artificial Intelligence >Similarity Calculation with Length Delimiting Dictionary Distance
【24h】

Similarity Calculation with Length Delimiting Dictionary Distance

机译:长度界定字典距离的相似度计算

获取原文

摘要

The Normalized Compression Distance (NCD) has gained considerable interest in pattern recognition as a similarity measure applicable to unstructured data of very different domains, such as text, DNA sequences, or images. NCD uses existing compression programs such as gzip to compute similarity between objects. NCD has unique features: It does not require any prior knowledge, data preprocessing, feature extraction, domain adaptation or any parameter settings. Further, the NCD can be applied to symbolic data and raw signals alike. In this paper we decompose the NCD and introduce a method to measure compression-based similarity without the need to use compression. The Length Delimiting Dictionary Distance (LD³) takes the one component essential in compression methods, the dictionary generation, and strips the NCD of all dispensable components. The LD³ performs "compression based pattern recognition without compression", keeping all of the above benefits of the NCD while achieving better speed and recognition rates. We first review the NCD, introduce LD³ as the "essence" of NCD, and evaluate the LD³ based on language tree experiments, authorship recognition, and genome phylogeny data.
机译:归一化压缩距离(NCD)在模式识别方面已引起广泛关注,作为一种相似性度量,适用于文本,DNA序列或图像等非常不同的域的非结构化数据。 NCD使用现有的压缩程序(例如gzip)来计算对象之间的相似度。 NCD具有独特的功能:它不需要任何先验知识,数据预处理,特征提取,域适配或任何参数设置。此外,NCD可以类似地应用于符号数据和原始信号。在本文中,我们分解了NCD,并介绍了一种无需使用压缩即可测量基于压缩的相似性的方法。长度定界字典距离(LD³)采用压缩方法中必不可少的一个组成部分,即字典生成,并剥离所有可分配组成部分的NCD。 LD³执行“基于压缩的模式识别而无需压缩”,在保持更好的速度和识别率的同时,保留了NCD的所有上述优点。我们首先回顾一下NCD,将LD³介绍为NCD的“本质”,然后根据语言树实验,作者身份识别和基因组系统发育数据对LD³进行评估。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号