...
首页> 外文期刊>IEEE Transactions on Emerging Topics in Computational Intelligence >A Mechanics-Based Similarity Measure for Text Classification in Machine Learning Paradigm
【24h】

A Mechanics-Based Similarity Measure for Text Classification in Machine Learning Paradigm

机译:基于机械基础分类的机制相似度测量范式

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Document classification and clustering is emerging as a new challenge in the Big Data era where terabytes of data are generated every second through billions of mobile phones, desktops, servers, and mobile devices such as cameras and watches. The effectiveness of classification and clustering algorithms depends on the similarity measure used between two text documents in the corpus. We have applied Maxwell–Boltzmann distribution to find the similarity between the two documents within a document corpus. In this paper, the document corpus is treated as a large system, individual documents as containers, attributes as subcontainers, and each term as a particle. The proposed similarity measure is named Maxwell–Boltzmann Similarity Measure (MBSM). MBSM is derived from the overall distribution of feature values and total number of nonzero features among the documents. We demonstrate that MBSM satisfies all properties of a document similarity measure. The MBSM is incorporated in single label $ K$-nearest neighbors classification (SLKNN), multi label $ K$-nearest neighbors classification (MLKNN) and $ K$-means clustering. We benchmark MBSM against other similarity measures like Euclidian, Cosine, Jaccard, Pairwise, ITSim, and SMTP. The comparative performance shows that MBSM outperformed all existing similarity measures and increased classification accuracy of SLKNN and MLKNN and clustering accuracy and entropy of $ K$-means algorithm while making them more robust. The highest accuracy obtained from tenfold cross validation for SLKNN is 0.9531 and MLKNN is 0.9373. The MBSM achieved maximum accuracy of 0.6592 and minimum entropy of 0.2426 amongst all similarity measures in the scale of unity for $ K$-means clustering.
机译:文档分类和群集是在大数据时代的新挑战,其中每秒通过数十亿个移动电话,桌面,服务器和移动设备(如相机和手表)生成每秒数据。分类和聚类算法的有效性取决于语料库中的两个文本文档之间使用的相似度量。我们已应用Maxwell-Boltzmann分发,以查找文档语料库中的两个文档之间的相似性。在本文中,文档语料库被视为大型系统,作为容器,属性作为子通道,每个术语作为粒子。所提出的相似度测量名为Maxwell-Boltzmann相似度量(MBSM)。 MBSM来自文档中的特征值的总分布和非零特征的总数。我们证明MBSM满足文档相似度措施的所有属性。 MBSM包含在单一标签中<内联公式XMLNS:MML =“http://www.w3.org/1998/math/mathml”xmlns:xlink =“http://www.w3.org/1999/xlink”> $ k $ - 最终邻居分类(SLKNN),多标签<内联公式XMLNS:MML =“http://www.w3.org/1998/math/mathml”xmlns:xlink =“http://www.w3.org/1999/xlink”> $ k $ - 最终邻居分类(mlknn)和<内联公式XMLNS:MML =“http://www.w3.org/1998/math/mathml”xmlns:xlink =“http://www.w3.org/1999/xlink”> $ k $ - eans集群。我们将MBSM基于欧几里德,余弦,Jaccard,成对,ITSIM和SMTP等其他相似措施。比较绩效表明,MBSM优于所有现有的相似度措施,并提高了SLKNN和MLKN的分类准确性和聚类准确性和熵<内联公式XMLNS:MML =“http://www.w3.org/1998/math/mathml”xmlns:xlink =“http://www.w3.org/1999/xlink”> $ k $ - 在使它们更加强大的同时eans算法。从SLKNN十倍交叉验证获得的最高精度为0.9531,MLKNN为0.9373。 MBSM实现了0.6592的最大精度,并在统一规模中的所有相似度措施中实现了0.6592的最大精度和0.2426的最小熵<内联公式XMLNS:MML =“http://www.w3.org/1998/math/mathml”xmlns:xlink =“http://www.w3.org/1999/xlink”> $ k $ - eans集群。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号