首页> 外文会议>SAI Computing Conference >Visualizing document similarity using n-grams and latent semantic analysis
【24h】

Visualizing document similarity using n-grams and latent semantic analysis

机译:使用n克和潜在语义分析来可视化文档相似度

获取原文

摘要

As the number of information resources and document quantity explodes, efficient tools with intuitive visualization capabilities desperately needed to assist users in conducting document similarity analysis and/or plagiarism detection tasks by discovering hidden relations among documents. This paper proposes a content-based method for document similarity analysis and visualization. The proposed method is based on modeling the relationship between documents and their n-gram phrases, which are generated from the normalized text, exploiting morphology analysis and lexical lookup. Resolving possible morphological ambiguities is carried out by tagging the words within the examined documents. Text indexing and stop-words removal are performed, employing a new technique that is efficient in dealing with multiple long documents. The examined documents' TF-IDF model is constructed using heuristic based pair-wise matching algorithm, considering lexical and syntactic changes. Then, the hidden associations between the documents and their unique n-gram phrases are investigated using Latent Semantic Analysis (LSA). Next, the pairwise document subset and similarity measures are derived from the Singular Value Decomposition (SVD) computations. Different visualization techniques are then applied on the SVD results to expose the hidden relations among the documents under consideration. As Arabic is one of the most morphological and complicated languages, this paper emphasizes Arabic documents similarity analysis and visualization. Various experiments were carried out revealing the strong capabilities of the proposed method in analyzing and visualizing literal and some types of intelligent similarities.
机译:随着信息资源和文档数量的数量爆炸,迫切需要直观可视化功能的高效工具,以帮助用户通过发现文档之间的隐藏关系来进行文档相似度分析和/或抄袭检测任务。本文提出了一种基于内容的文档相似性分析和可视化方法。所提出的方法是基于在归一化文本,利用形态分析和词法查找之间建模文档和其n-gram短语之间的关系。解决可能的形态模糊,通过标记审查的文件中的单词来进行。执行文本索引和删除删除,采用新技术,该技术在处理多个长文档方面是有效的。考虑词汇和句法变化,使用基于启发式的一对匹配算法构建了检查的文档的TF-IDF模型。然后,使用潜在语义分析(LSA)来研究文档与其唯一的n-gram短语之间的隐藏关联。接下来,从奇异值分解(SVD)计算导出成对文档子集和相似度测量。然后应用于SVD结果的不同可视化技术,以暴露所考虑的文件之间的隐藏关系。由于阿拉伯语是最形态和复杂复杂的语言之一,本文强调阿拉伯文档相似性分析和可视化。进行各种实验,揭示了所提出的方法在分析和可视化文字和某种类型的智能相似性方面的强大能力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号