首页> 外文OA文献 >Development of a document classification method by using geodesic distance to calculate similarity of documents
【2h】

Development of a document classification method by using geodesic distance to calculate similarity of documents

机译:使用测地距离来计算文档分类方法以计算文档的相似性

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Currently, the Internet has given people the opportunity to access to human knowledge quickly and conveniently through various channels such as Web pages, social networks, digital libraries, portals... However, with the process of exchanging and updating information quickly, the volume of information stored (in the form of digital documents) is increasing rapidly. Therefore, we are facing challenges in representing, storing, sorting and classifying documents.In this paper, we present a new approach to text classification. This approach is based on semi-supervised machine learning and Support Vector Machine (SVM). The new point of the study is that instead of calculating the distance between the vectors by Euclidean distance, we use geodesic distance. To do this, the text must first be expressed as an n-dimensional vector. In the n-dimensional vector space, each vector is represented by one point; use geodesic distance to calculate the distance from a point to nearby points and connect into a graph. The classification is based on calculating the shortest path between vertices on the graph through a kernel function. We conducted experiments on articles taken from Reuters on 5 different topics. To evaluate the proposed method, we tested the SVM method with the traditional calculation based on Euclidean distance and the method we proposed based on geodesic distance. The experiment was performed on the same data set of 5 topics: Business, Markets, World, Politics, and Technology. The results showed that the correct classification rate is better than the traditional SVM method based on Euclidean distance (average of 3.2 %)
机译:目前,互联网使人们能够通过网页,社交网络,数字图书馆,门户等各种渠道快速方便地获得人类知识的机会,其中包括快速交换和更新信息的过程存储的信息(以数字文档的形式)正在迅速增加。因此,我们面临代表,存储,排序和分类文件的挑战。本文,我们提出了一种新的文本分类方法。这种方法是基于半监督机器学习和支持向量机(SVM)。研究的新点是,我们使用测地距离来计算矢量之间的距离。为此,文本必须首先表示为N维矢量。在N维矢量空间中,每个载体由一个点表示;使用测GeodeSic距离计算从点到附近点的点到附近的距离并连接到图形中。分类基于通过内核函数计算图表上顶点之间的最短路径。我们对来自路透社的文章进行了实验,以5种不同的主题。为了评估所提出的方法,我们通过基于欧几里德距离的传统计算和基于测地距的方法测试了SVM方法。该实验是在相同的5个主题的数据集:商业,市场,世界,政治和技术上进行。结果表明,正确的分类率优于基于欧几里德距离的传统SVM方法(平均为3.2%)

著录项

  • 作者

    Hung Vo-Trung;

  • 作者单位
  • 年度 2020
  • 总页数
  • 原文格式 PDF
  • 正文语种 rus;ukr;eng
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号