首页> 外文期刊>Expert Systems with Application >Framework for syntactic string similarity measures
【24h】

Framework for syntactic string similarity measures

机译:句法字符串相似性度量的框架

获取原文
获取原文并翻译 | 示例

摘要

Similarity measure is an essential component of information retrieval, document clustering, text summarization, and question answering, among others. In this paper, we introduce a general framework of syntactic similarity measures for matching short text. We thoroughly analyze the measures by dividing them into three components: character-level similarity, string segmentation, and matching technique. Soft variants of the measures are also introduced. With the help of two existing toolkits (SecondString and SimMetric), we provide an open-source Java toolkit of the proposed framework, which integrates the individual components together so that completely new combinations can be created. Experimental results reveal that the performance of the similarity measures depends on the type of the dataset. For well-maintained dataset, using a token-level measure is important but the basic (crisp) variant is usually enough. For uncontrolled dataset where typing errors are expected, the soft variants of the token-level measures are necessary. Among all tested measures, a soft token-level measure that combines set matching and q-grams at the character level perform best. A gap between human perception and syntactic measures still remains due to lacking semantic analysis. (C) 2019 Elsevier Ltd. All rights reserved.
机译:相似性度量是信息检索,文档聚类,文本摘要和问题解答等必不可少的组成部分。在本文中,我们介绍了一种用于匹配短文本的句法相似性度量的通用框架。我们将措施分为三个部分来全面分析这些措施:字符级相似度,字符串分段和匹配技术。还介绍了这些措施的软变体。借助于现有的两个工具箱(SecondString和SimMetric),我们提供了所建议框架的开源Java工具箱,该工具箱将各个组件集成在一起,从而可以创建全新的组合。实验结果表明,相似性度量的性能取决于数据集的类型。对于维护良好的数据集,使用令牌级别的度量很重要,但是基本(酥脆)变量通常就足够了。对于可能出现键入错误的不受控制的数据集,令牌级别度量的软变体是必需的。在所有测试过的度量标准中,在字符级别将集合匹配和q-gram组合在一起的软令牌级别度量标准最为有效。由于缺乏语义分析,人类感知与句法手段之间仍然存在差距。 (C)2019 Elsevier Ltd.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号