Proposal and study of statistical features for string similarity computation and classification

E.O. Rodrigues; D. Casanova; M. Teixeira; V. Pegorini; F. Favarim; E. Clua; A. Conci; Panos Liatsis

首页> 外文期刊>International journal of data mining, modelling and management >Proposal and study of statistical features for string similarity computation and classification

【24h】

Proposal and study of statistical features for string similarity computation and classification

机译：串相似计算和分类的统计特征的提案和研究

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Adaptations of features commonly applied in the field of visual computing, co-occurrence matrix (COM) and run-length matrix (RLM), are proposed for the similarity computation of strings in general (words, phrases, codes and texts). The proposed features are not sensitive to language related information. These are purely statistical and can be used in any context with any language or grammatical structure. Other statistical measures that are commonly employed in the field such as longest common subsequence, maximal consecutive longest common subsequence, mutual information and edit distances are evaluated and compared. In the first synthetic set of experiments, the COM and RLM features outperform the remaining state-of-the-art statistical features. In 3 out of 4 cases, the RLM and COM features were statistically more significant than the second best group based on distances (P-value 0.001). When it comes to a real text plagiarism dataset, the RLM features obtained the best results.

机译：提出了通常应用于视觉计算，共发生矩阵（COM）和运行长度矩阵（RLM）领域的特征的特征，用于一般的字符串的相似性计算（单词，短语，代码和文本）。所提出的功能对语言相关信息不敏感。这些纯粹统计，可以在任何语言或语法结构的任何语境中使用。在诸如最长的常见子序列中常用的其他统计措施，并评估最大连续最长的常见子序列，相互信息和编辑距离。在第一个合成的实验集中，COM和RLM特征优于剩余的最先进的统计特征。在4例中有3例，RLM和COM功能与基于距离的第二个最佳组有统计学上（P值<0.001）。谈到真实的文本抄袭数据集时，RLM功能获得了最佳结果。

著录项

来源
《International journal of data mining, modelling and management》 |2020年第3期|277-307|共31页
作者
E.O. Rodrigues; D. Casanova; M. Teixeira; V. Pegorini; F. Favarim; E. Clua; A. Conci; Panos Liatsis;
展开▼
作者单位

Academic Department of Informatics Universidade Tecnologica Federal do Paraná (UTFPR);

Academic Department of Informatics Universidade Tecnologica Federal do Paraná (UTFPR);

Academic Department of Informatics Universidade Tecnologica Federal do Paraná (UTFPR);

Academic Department of Informatics Universidade Tecnologica Federal do Paraná (UTFPR);

Academic Department of Informatics Universidade Tecnologica Federal do Paraná (UTFPR);

Department of Computer Science Universidade Federal Fluminense (UFF);

Department of Computer Science Universidade Federal Fluminense (UFF);

Department of Electrical Engineering and Computer Science Khalifa University;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
word comparison; string similarity; classification; statistical features; text mining; optical character recognition; OCR; text plagiarism; text entailment; supervised learning;

机译：单词比较;字符串相似性;分类;统计特征;文本挖掘;光学字符识别;OCR;文本抄袭;文本征集;监督学习;
入库时间 2022-08-18 21:31:41

相似文献

外文文献
中文文献
专利

1. A Study of Image Retrieval System Based on Feature Extraction, Selection, Classification and Similarity Measurements [J] . Yogapriya J., Saravanabhavan C., Asokan R., Journal of Medical Imaging and Health Informatics . 2018,第3期

机译：基于特征提取，选择，分类和相似度测量的图像检索系统研究
2. Classification Similarity Learning Using Feature-Based and Distance-Based Representations: A Comparative Study [J] . Lopez-Inesta Emilia, Grimaldo Francisco, Arevalillo-Herraez Miguel Applied Artificial Intelligence . 2015,第4a6期

机译：使用基于特征和基于距离的表示进行分类相似性学习的比较研究
3. Comparative Study of Motion Features for Similarity-Based Modeling and Classification of Unsafe Actions in Construction [J] . SangUk Han, SangHyun Lee, Feniosky Pena-Mora Journal of Computing in Civil Engineering . 2014,第5期

机译：基于相似度建模和施工中不安全动作分类的运动特征比较研究
4. A study of the effect of feature reduction via statistically significant pixel selection on fruit object representation, classification, and machine learning prediction [C] . Beaulieu P., Megherbi D.B. IEEE International Conference on Computational lntelligence and Virtual Environments for Measurement Systems and Applications . 2014

机译：通过统计上显着的像素选择对水果对象表示，分类和机器学习预测进行特征缩减的影响的研究
5. An Electronic Nose System Based on Evolutionary Computation and Similarity Measures for Classification and Quantification of Gases [D] . Rehman, Atiq Ur. 2019

机译：基于进化计算的电子鼻系统和气体分类和量化的相似措施
6. Multicentric study: statistical correlation between clinical data and instrumental findings in laryngo-pharyngeal reflux: proposal for a new ENT classification of reflux [O] . CA Leone, F Mosca 2006

机译：多中心研究：喉咽反流的临床数据与仪器发现之间的统计相关性：新的ENT反流分类建议
7. Graph-based feature extraction: A new proposal to study the classification of music signals outside the time-frequency domain [O] . Dirceu de Freitas Piedade Melo, Inacio de Sousa Fadigas, Hernane Borges de Barros Pereira 2020

机译：基于图形的特征提取：研究时频域外音乐信号分类的新提案
8. Keypoint Density-Based Region Proposal for Fine-Grained Object Detection and Classification Using Regions with Convolutional Neural Network Features. [R] . Turner, J. T., Gupta, K., Morris, B., 2015

机译：基于关键点密度的区域提议，用于使用具有卷积神经网络特征的区域进行细粒度目标检测和分类。

Proposal and study of statistical features for string similarity computation and classification

摘要

著录项

相似文献

相关主题

期刊订阅