COMPARISON OF STRING SIMILARITY ALGORITHMS TO MEASURE LEXICAL SIMILARITY

Sagar J. Gandhi; Mihirraj M. Thakor; Jikitsha Sheth; Hariom I.Pandit; Hemin S. Patel

首页> 外文期刊>National Journal of System and Information Technology >COMPARISON OF STRING SIMILARITY ALGORITHMS TO MEASURE LEXICAL SIMILARITY

【24h】

COMPARISON OF STRING SIMILARITY ALGORITHMS TO MEASURE LEXICAL SIMILARITY

机译：字符串相似度算法与词性相似度的比较

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

A string similarity represents the lexical similarity between two words. This can be further exploited to identify similarity between questions. Several string similarity algorithm exists in literature. In this paper the authors have implemented five string similarity algorithms viz. Dice coefficient, Jaccard similarity, Levenshtein distance, Jaro distance and Cosine similarity. The results of these algorithms are further compared with human judges to determine, which of them resembles the human way to dissimilarize the given strings. The experimentation is done over 1000 English word pairs.

机译：字符串相似度表示两个单词之间的词汇相似度。可以进一步利用它来确定问题之间的相似性。文献中存在几种字符串相似性算法。在本文中，作者已经实现了五种字符串相似性算法。骰子系数，Jaccard相似度，Levenshtein距离，Jaro距离和余弦相似度。将这些算法的结果进一步与人工判断者进行比较，以确定哪种算法类似于人工使给定字符串与众不同的方式。实验完成了1000多个英语单词对。

著录项

来源
《National Journal of System and Information Technology》 |2017年第2期|139-154|共16页
作者
Sagar J. Gandhi; Mihirraj M. Thakor; Jikitsha Sheth; Hariom I.Pandit; Hemin S. Patel;
展开▼
作者单位

Institute of Management and Computer Applications of UTU, Bardoli;

Institute of Management and Computer Applications of UTU, Bardoli;

Shrimad Rajchandra Inst;

Institute of Management and Computer Applications of UTU, Bardoli;

Institute of Management and Computer Applications of UTU, Bardoli;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Comparison of Collaborative Filtering Algorithms with Various Similarity Measures for Movie Recommendation [J] . Taner Arsan, Efecan Koksal, Zeki Bozkus International Journal of Computer Science, Engineering and Applications (IJCSEA) . 2016,第3期

机译：电影推荐中具有多种相似度的协同过滤算法比较
2. Selecting Multiview Point Similarity from Different Methods of Similarity Measure to Perform Document Comparison [J] . S. Kalpana, S. Vigneshwari Indian Journal of Science and Technology . 2016,第10期

机译：从不同的相似性度量方法中选择多视点相似性以进行文档比较
3. A Similarity Mindset Matters on Social Media: Using Algorithm-Generated Similarity Metrics to Foster Assimilation in Upward Social Comparison [J] . Jin Kang, Bingjie Liu Social Media + Society . 2019,第4期

机译：社交媒体上的相似性心态：使用算法产生的相似度量来促进向上社会比较中的同化
4. An Investigation on Signal Comparison by Measuring of Numerical Strings Similarity [C] . Alexander Smaglichenko, Tatyana A. Smaglichenko, Arkady Genkin, International conference on advanced engineering - theory and applications . 2020

机译：数值字符串相似性测量信号比较的研究
5. Algorithms for string similarity with constraints. [D] . Arslan, Abdullah Necip. 2002

机译：带约束的字符串相似度算法。
6. Explaining Lexical Semantic Deficits in Specific Language Impairment: The Role of Phonological Similarity Phonological Working Memory and Lexical Competition [O] . Elina Mainela-Arnold, Julia L. Evans, Jeffry A. Coady -1

机译：解释特定语言障碍中的词汇语义缺陷：语音相似性语音工作记忆和词汇竞争的作用
7. Figure 4: (A) One conserved sequence, which occurs 79 times in 46,264 binding site peaks from the ChIP-seq data-set. The mutation profile of this conserved sequence is illustrated, where ’_ ’ indicates this base is unchanged; DEL indicates this base is lost; INS X indicates a new base X is inserted in front of this base. (B) Several repeated elements patterns are listed. (C) In the first column, the top five DNA motifs, mined by meme-chip tools (Machanick Bailey, 2011) are illustrated. The resemblant conserved sequences, found by the CFSP algorithm are listed in the second column. In the third column, the position-specific scoring matrices, which are transformed from mutational information are listed. The similarity between meme motif and resemblant conserved sequence with PSSM format was calculated via a stamp motif comparison tool (Mahony Benos, 2007). The E-values for the similarity of those pairs is displayed in the fourth column. (D) One motif is selected in each group clustered by gkmsvm descriptors, and the corresponding motif found by the CFSP algorithm is listed below. (E) There are additional datasets (File No: ENCFF100GRL, ENCFF616IRT, ENCFF870CER, Target: SREBF1) collected from https://www.encodeproject.org. The top two motifs are selected in each file using meme tools, and the corresponding motifs found by our algorithm are listed below. [O] . -1

机译：图4：（a）一种保守序列，其发生在芯片-SEQ数据集中的46,264个结合位点峰值中的79倍。说明了这种保守序列的突变分布，其中'_'表示该碱度不变; del表示此基础丢失; INS X表示新的基础X插入此基础前面。（b）列出了几种重复的元素模式。（c）在第一栏中，示出了由MEME芯片工具（Machanick＆Bailey，2011）开采的前五个DNA主题。由CFSP算法发现的相应保守序列列于第二列中。在第三列中，列出了从突变信息转换的特定位置的评分矩阵。 MEME主题与PSSM格式的相似性与PSSM格式之间的相似性通过邮票图章比较工具（Mahony＆Benos，2007）计算。这些对相似性的电子值显示在第四列中。（d）在由GKMSVM描述符聚集的每个组中选择了一个图案，下面列出了CFSP算法的相应主题。（e）从https://www.encodeproject.org收集的，有附加数据集（文件no：cernff100grl，cenf616irl，conf8.20cer，target：srebf1）。使用MEME工具在每个文件中选择前两个图案，并且我们的算法发现的相应主题如下所示。

COMPARISON OF STRING SIMILARITY ALGORITHMS TO MEASURE LEXICAL SIMILARITY

摘要

著录项

相似文献

相关主题

期刊订阅