首页> 外文期刊>Expert systems with applications >Framework for syntactic string similarity measures
【24h】

Framework for syntactic string similarity measures

机译:语法字符串相似度量的框架

获取原文
获取原文并翻译 | 示例

摘要

Similarity measure is an essential component of information retrieval, document clustering, text summarization, and question answering, among others. In this paper, we introduce a general framework of syntactic similarity measures for matching short text. We thoroughly analyze the measures by dividing them into three components: character-level similarity, string segmentation, and matching technique. Soft variants of the measures are also introduced. With the help of two existing toolkits (SecondString and SimMetric), we provide an open-source Java toolkit of the proposed framework, which integrates the individual components together so that completely new combinations can be created. Experimental results reveal that the performance of the similarity measures depends on the type of the dataset. For well-maintained dataset, using a token-level measure is important but the basic (crisp) variant is usually enough. For uncontrolled dataset where typing errors are expected, the soft variants of the token-level measures are necessary. Among all tested measures, a soft token-level measure that combines set matching and q-grams at the character level perform best. A gap between human perception and syntactic measures still remains due to lacking semantic analysis. (C) 2019 Elsevier Ltd. All rights reserved.
机译:相似度措施是信息检索,文档聚类,文本摘要和问题的重要组成部分。在本文中,我们介绍了匹配短文本的句法相似措施的一般框架。我们通过将它们划分为三个组件来彻底分析措施:字符级相似性,字符串分割和匹配技术。还介绍了措施的软变种。借助两个现有工具包(SwiteString和Simmetric),我们提供所提出的框架的开源Java工具包,该框架将各个组件集成在一起,以便可以创建完全新的组合。实验结果表明,相似度措施的性能取决于数据集的类型。对于维护良好的数据集,使用令牌级度量很重要,但基本(CREAP)变体通常足够。对于不受控制的数据集预期打字错误,需要令牌级措施的软变种是必要的。在所有测试的措施中,一个软令级别措施,将设置匹配和Q-Grams在字符级执行最佳。由于缺乏语义分析,人类感知和句法措施之间的差距仍然存在。 (c)2019 Elsevier Ltd.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号