首页> 外文OA文献 >The N-Grams Based Text Similarity Detection Approach Using Self-Organizing Maps and Similarity Measures
【2h】

The N-Grams Based Text Similarity Detection Approach Using Self-Organizing Maps and Similarity Measures

机译:基于N-GRAM的文本相似性检测方法,使用自组织地图和相似度测量

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

In the paper the word-level n-grams based approach is proposed to find similarity between texts. The approach is a combination of two separate and independent techniques: self-organizing map (SOM) and text similarity measures. SOM’s uniqueness is that the obtained results of data clustering, as well as dimensionality reduction, are presented in a visual form. The four measures have been evaluated: cosine, dice, extended Jaccard’s, and overlap. First of all, texts have to be converted to numerical expression. For that purpose, the text has been split into the word-level n-grams and after that, the bag of n-grams has been created. The n-grams’ frequencies are calculated and the frequency matrix of dataset is formed. Various filters are used to create a bag of n-grams: stemming algorithms, number and punctuation removers, stop words, etc. All experimental investigation has been made using a corpus of plagiarized short answers dataset.
机译:在论文中,提出了基于词级的N-GRAMS方法来在文本之间找到相似之处。该方法是两种单独和独立技术的组合:自组织地图(SOM)和文本相似度措施。 SOM的唯一性是,所获得的数据聚类结果以及维度减少,以视觉形式呈现。这四项措施已被评估:余弦,骰子,延长Jaccard和重叠。首先,必须将文本转换为数字表达式。为此目的,文本已被分成单词级别的n-grams,之后,已经创建了n-gram的袋子。计算n-grams的频率,并形成数据集的频率矩阵。各种过滤器用于创建一袋N-GRAM:Stemming算法,数量和标点符号,停止单词等。所有的实验调查都是使用抄袭短答案数据集的语料库进行的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号