首页> 外文期刊>International journal of data mining, modelling and management >Proposal and study of statistical features for string similarity computation and classification
【24h】

Proposal and study of statistical features for string similarity computation and classification

机译:串相似计算和分类的统计特征的提案和研究

获取原文
获取原文并翻译 | 示例
       

摘要

Adaptations of features commonly applied in the field of visual computing, co-occurrence matrix (COM) and run-length matrix (RLM), are proposed for the similarity computation of strings in general (words, phrases, codes and texts). The proposed features are not sensitive to language related information. These are purely statistical and can be used in any context with any language or grammatical structure. Other statistical measures that are commonly employed in the field such as longest common subsequence, maximal consecutive longest common subsequence, mutual information and edit distances are evaluated and compared. In the first synthetic set of experiments, the COM and RLM features outperform the remaining state-of-the-art statistical features. In 3 out of 4 cases, the RLM and COM features were statistically more significant than the second best group based on distances (P-value 0.001). When it comes to a real text plagiarism dataset, the RLM features obtained the best results.
机译:提出了通常应用于视觉计算,共发生矩阵(COM)和运行长度矩阵(RLM)领域的特征的特征,用于一般的字符串的相似性计算(单词,短语,代码和文本)。所提出的功能对语言相关信息不敏感。这些纯粹统计,可以在任何语言或语法结构的任何语境中使用。在诸如最长的常见子序列中常用的其他统计措施,并评估最大连续最长的常见子序列,相互信息和编辑距离。在第一个合成的实验集中,COM和RLM特征优于剩余的最先进的统计特征。在4例中有3例,RLM和COM功能与基于距离的第二个最佳组有统计学上(P值<0.001)。谈到真实的文本抄袭数据集时,RLM功能获得了最佳结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号