首页> 外文会议>SAI Computing Conference >Visualizing document similarity using n-grams and latent semantic analysis

【24h】

Visualizing document similarity using n-grams and latent semantic analysis

机译：使用n克和潜在语义分析来可视化文档相似度

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

As the number of information resources and document quantity explodes, efficient tools with intuitive visualization capabilities desperately needed to assist users in conducting document similarity analysis and/or plagiarism detection tasks by discovering hidden relations among documents. This paper proposes a content-based method for document similarity analysis and visualization. The proposed method is based on modeling the relationship between documents and their n-gram phrases, which are generated from the normalized text, exploiting morphology analysis and lexical lookup. Resolving possible morphological ambiguities is carried out by tagging the words within the examined documents. Text indexing and stop-words removal are performed, employing a new technique that is efficient in dealing with multiple long documents. The examined documents' TF-IDF model is constructed using heuristic based pair-wise matching algorithm, considering lexical and syntactic changes. Then, the hidden associations between the documents and their unique n-gram phrases are investigated using Latent Semantic Analysis (LSA). Next, the pairwise document subset and similarity measures are derived from the Singular Value Decomposition (SVD) computations. Different visualization techniques are then applied on the SVD results to expose the hidden relations among the documents under consideration. As Arabic is one of the most morphological and complicated languages, this paper emphasizes Arabic documents similarity analysis and visualization. Various experiments were carried out revealing the strong capabilities of the proposed method in analyzing and visualizing literal and some types of intelligent similarities.

机译：随着信息资源和文档数量的数量爆炸，迫切需要直观可视化功能的高效工具，以帮助用户通过发现文档之间的隐藏关系来进行文档相似度分析和/或抄袭检测任务。本文提出了一种基于内容的文档相似性分析和可视化方法。所提出的方法是基于在归一化文本，利用形态分析和词法查找之间建模文档和其n-gram短语之间的关系。解决可能的形态模糊，通过标记审查的文件中的单词来进行。执行文本索引和删除删除，采用新技术，该技术在处理多个长文档方面是有效的。考虑词汇和句法变化，使用基于启发式的一对匹配算法构建了检查的文档的TF-IDF模型。然后，使用潜在语义分析（LSA）来研究文档与其唯一的n-gram短语之间的隐藏关联。接下来，从奇异值分解（SVD）计算导出成对文档子集和相似度测量。然后应用于SVD结果的不同可视化技术，以暴露所考虑的文件之间的隐藏关系。由于阿拉伯语是最形态和复杂复杂的语言之一，本文强调阿拉伯文档相似性分析和可视化。进行各种实验，揭示了所提出的方法在分析和可视化文字和某种类型的智能相似性方面的强大能力。

著录项

来源
《SAI Computing Conference》|2016年|1 v.|共11页
会议地点
作者
Ashraf S. Hussein;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类安全保密;
关键词
Plagiarism; Text analysis; Data visualization; Semantics; Visualization; Natural language processing;

机译：抄袭;文本分析;数据可视化;语义;可视化;自然语言处理;
入库时间 2022-08-20 23:10:52

相似文献

外文文献
中文文献
专利

1. Exploring the feasibility and accuracy of Latent Semantic Analysis based text mining techniques to detect similarity between patent documents and scientific publications [J] . Tom Magerman, Bart Van Looy, Xiaoyan Song Scientometrics . 2010,第2期

机译：探索基于潜在语义分析的文本挖掘技术的可行性和准确性，以检测专利文献与科学出版物之间的相似性
2. Comparison of Latent Semantic Analysis and Probabilistic Latent Semantic Analysis for Documents Clustering [J] . Kuta, Marcin, Kitowski, Computing and informatics . 2015,第3期

机译：文档聚类的潜在语义分析与概率潜在语义分析的比较
3. COMPARISON OF LATENT SEMANTIC ANALYSIS AND PROBABILISTIC LATENT SEMANTIC ANALYSIS FOR DOCUMENTS CLUSTERING [J] . Marcin Kuta, Jacek Kitowski Computing and informatics . 2014,第3期

机译：文档聚类的潜在语义分析和概率潜在语义分析的比较
4. Visualizing document similarity using n-grams and latent semantic analysis [C] . Ashraf S. Hussein 2016 SAI Computing Conference . 2016

机译：使用n-gram和潜在语义分析可视化文档相似性
5. Generalized latent semantic analysis for document representation [D] . Matveeva, Irina 2008

机译：用于文档表示的广义潜在语义分析
6. MOWDOC: A Dataset of Documents From Taking the Measure of Work for Building a Latent Semantic Analysis Space [O] . Kim F. Nimon 2020

机译：mowdoc：从衡量建立潜在语义分析空间的工作的文件数据集
7. Visualizing Document Authorship Using N-grams and Latent Semantic Indexing [O] . Ian M. Soboroff, Charles K. Nicholas, James M. Kukla, 1997

机译：使用N-gram和潜在语义索引可视化文档作者

Visualizing document similarity using n-grams and latent semantic analysis

摘要

著录项

相似文献

相关主题

期刊订阅