Similarity Calculation with Length Delimiting Dictionary Distance

机译：长度界定字典距离的相似度计算

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The Normalized Compression Distance (NCD) has gained considerable interest in pattern recognition as a similarity measure applicable to unstructured data of very different domains, such as text, DNA sequences, or images. NCD uses existing compression programs such as gzip to compute similarity between objects. NCD has unique features: It does not require any prior knowledge, data preprocessing, feature extraction, domain adaptation or any parameter settings. Further, the NCD can be applied to symbolic data and raw signals alike. In this paper we decompose the NCD and introduce a method to measure compression-based similarity without the need to use compression. The Length Delimiting Dictionary Distance (LD³) takes the one component essential in compression methods, the dictionary generation, and strips the NCD of all dispensable components. The LD³ performs "compression based pattern recognition without compression", keeping all of the above benefits of the NCD while achieving better speed and recognition rates. We first review the NCD, introduce LD³ as the "essence" of NCD, and evaluate the LD³ based on language tree experiments, authorship recognition, and genome phylogeny data.

机译：归一化压缩距离（NCD）在模式识别方面已引起广泛关注，作为一种相似性度量，适用于文本，DNA序列或图像等非常不同的域的非结构化数据。 NCD使用现有的压缩程序（例如gzip）来计算对象之间的相似度。 NCD具有独特的功能：它不需要任何先验知识，数据预处理，特征提取，域适配或任何参数设置。此外，NCD可以类似地应用于符号数据和原始信号。在本文中，我们分解了NCD，并介绍了一种无需使用压缩即可测量基于压缩的相似性的方法。长度定界字典距离（LD³）采用压缩方法中必不可少的一个组成部分，即字典生成，并剥离所有可分配组成部分的NCD。 LD³执行“基于压缩的模式识别而无需压缩”，在保持更好的速度和识别率的同时，保留了NCD的所有上述优点。我们首先回顾一下NCD，将LD³介绍为NCD的“本质”，然后根据语言树实验，作者身份识别和基因组系统发育数据对LD³进行评估。

著录项

来源
《2011 23rd IEEE International Conference on Tools with Artificial Intelligence》|2011年|p.856-864|共9页
会议地点
作者
Burkovski Andre; Klenk Sebastian; Heidemann Gunther;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类人工智能理论;
关键词
dictionary-based compression; normalized compression distance; parameter-free data mining; pattern recognition; similarity metric;

机译：基于字典的压缩;归一化压缩距离;无参数数据挖掘;模式识别;相似度;

相似文献

外文文献
中文文献
专利

1. Linear regression model of short k-word: A similarity distance suitable for biological sequences with various lengths [J] . YangX., WangT. Journal of Theoretical Biology . 2013,第Null期

机译：短K字的线性回归模型：适用于各种长度的生物序列的相似距离
2. The determination of pair-distance distribution by double electron electron resonance: regularization by the length of distance discretization with Monte Carlo calculations [J] . Dzuba Sergei A. Journal of magnetic resonance . 2016,第Null期

机译：通过双电子电子共振确定对距分布：通过距离离散化长度的正则化和蒙特卡洛计算
3. Image Similarity Estimation Based on Ratio and Distance Calculation between Features [J] . R. P. Bohush, S. V. Ablameyko, E. R. Adamovskiy, Pattern recognition and image analysis: advances in mathematical theory and applications in the USSR . 2020,第2期

机译：基于特征与距离计算的图像相似估计
4. Similarity Calculation with Length Delimiting Dictionary Distance [C] . Andre Burkovski, Sebastian Klenk, Gunther Heidemann International Conference on Tools with Artificial Intelligence . 2011

机译：长度分隔字典距离的相似性计算
5. Fast Edit Distance Calculation Methods for NGS Sequence Similarity [D] . Islam, A. K. M. Tauhidul. 2020

机译：NGS序列相似性快速编辑距离计算方法
6. IDSSIM: an lncRNA functional similarity calculation model based on an improved disease semantic similarity method [O] . Wenwen Fan, Junliang Shang, Feng Li, 2020

机译：IDSSIM：基于改进疾病语义相似方法的LNCRNA功能相似性计算模型
7. Table S3: Species delimiting as implemented in Geneious, using ML and BI for both the total dataset and a subgroup of MOTUs. Closest species, Intraspecific distance, Interspecies distance, ratio of Intra/Interspecific, P ID(strict), Rosenberg’s Pab, and Rodrigo’s P(RD) are indicated. Colours code for significance. c, d, g, gr, hu, s, and sp. correspond with the respective taxon names; NA, not applicable [O] . -1

机译：表S3：在诸如佐中实施的物种界定，使用ML和BI用于总数据集和Motus的子组。表示最接近的物种，有内径，距离距离，内部/间隙的比例，P id（严格），罗森伯格的PAB和罗德里戈的P（RD）。颜色代码以实现重要性。 C，D，G，GR，HU，S和SP。与各自的分类名称相对应; na，不适用
8. NONEUC: A Numerical Procedure for Determining Optimum Euclidian Distances and Associated Coordinates from Distances Derived from Similarity Coefficients [R] . Fields, D. E., Kelsey, C. T., Goff, F. G. 1977

机译：NONEUC：从相似系数导出的距离确定最佳欧几里德距离和相关坐标的数值方法

Similarity Calculation with Length Delimiting Dictionary Distance

摘要

著录项

相似文献

相关主题

期刊订阅