A Mechanics-Based Similarity Measure for Text Classification in Machine Learning Paradigm

Venkatanareshbabu Kuppili; Mainak Biswas; Damodar Reddy Edla; K. J. Ravi Prasad; Jasjit S. Suri

首页> 外文期刊>IEEE Transactions on Emerging Topics in Computational Intelligence >A Mechanics-Based Similarity Measure for Text Classification in Machine Learning Paradigm

【24h】

A Mechanics-Based Similarity Measure for Text Classification in Machine Learning Paradigm

机译：基于机械基础分类的机制相似度测量范式

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Document classification and clustering is emerging as a new challenge in the Big Data era where terabytes of data are generated every second through billions of mobile phones, desktops, servers, and mobile devices such as cameras and watches. The effectiveness of classification and clustering algorithms depends on the similarity measure used between two text documents in the corpus. We have applied Maxwell–Boltzmann distribution to find the similarity between the two documents within a document corpus. In this paper, the document corpus is treated as a large system, individual documents as containers, attributes as subcontainers, and each term as a particle. The proposed similarity measure is named Maxwell–Boltzmann Similarity Measure (MBSM). MBSM is derived from the overall distribution of feature values and total number of nonzero features among the documents. We demonstrate that MBSM satisfies all properties of a document similarity measure. The MBSM is incorporated in single label

$ K$

-nearest neighbors classification (SLKNN), multi label

$ K$

-nearest neighbors classification (MLKNN) and

$ K$

-means clustering. We benchmark MBSM against other similarity measures like Euclidian, Cosine, Jaccard, Pairwise, ITSim, and SMTP. The comparative performance shows that MBSM outperformed all existing similarity measures and increased classification accuracy of SLKNN and MLKNN and clustering accuracy and entropy of

$ K$

-means algorithm while making them more robust. The highest accuracy obtained from tenfold cross validation for SLKNN is 0.9531 and MLKNN is 0.9373. The MBSM achieved maximum accuracy of 0.6592 and minimum entropy of 0.2426 amongst all similarity measures in the scale of unity for

$ K$

-means clustering.

机译：文档分类和群集是在大数据时代的新挑战，其中每秒通过数十亿个移动电话，桌面，服务器和移动设备（如相机和手表）生成每秒数据。分类和聚类算法的有效性取决于语料库中的两个文本文档之间使用的相似度量。我们已应用Maxwell-Boltzmann分发，以查找文档语料库中的两个文档之间的相似性。在本文中，文档语料库被视为大型系统，作为容器，属性作为子通道，每个术语作为粒子。所提出的相似度测量名为Maxwell-Boltzmann相似度量（MBSM）。 MBSM来自文档中的特征值的总分布和非零特征的总数。我们证明MBSM满足文档相似度措施的所有属性。 MBSM包含在单一标签中<内联公式XMLNS：MML =“http://www.w3.org/1998/math/mathml”xmlns：xlink =“http://www.w3.org/1999/xlink”> $ k $ - 最终邻居分类（SLKNN），多标签<内联公式XMLNS：MML =“http://www.w3.org/1998/math/mathml”xmlns：xlink =“http://www.w3.org/1999/xlink”> $ k $ - 最终邻居分类（mlknn）和<内联公式XMLNS：MML =“http://www.w3.org/1998/math/mathml”xmlns：xlink =“http://www.w3.org/1999/xlink”> $ k $ - eans集群。我们将MBSM基于欧几里德，余弦，Jaccard，成对，ITSIM和SMTP等其他相似措施。比较绩效表明，MBSM优于所有现有的相似度措施，并提高了SLKNN和MLKN的分类准确性和聚类准确性和熵<内联公式XMLNS：MML =“http://www.w3.org/1998/math/mathml”xmlns：xlink =“http://www.w3.org/1999/xlink”> $ k $ - 在使它们更加强大的同时eans算法。从SLKNN十倍交叉验证获得的最高精度为0.9531，MLKNN为0.9373。 MBSM实现了0.6592的最大精度，并在统一规模中的所有相似度措施中实现了0.6592的最大精度和0.2426的最小熵<内联公式XMLNS：MML =“http://www.w3.org/1998/math/mathml”xmlns：xlink =“http://www.w3.org/1999/xlink”> $ k $ - eans集群。

著录项

来源
《IEEE Transactions on Emerging Topics in Computational Intelligence》 |2020年第2期|180-200|共21页
作者
Venkatanareshbabu Kuppili; Mainak Biswas; Damodar Reddy Edla; K. J. Ravi Prasad; Jasjit S. Suri;
展开▼
作者单位

Department of Computer Science and Engineering National Institute of Technology Farmagudi India;

Department of Computer Science and Engineering National Institute of Technology Farmagudi India;

Department of Computer Science and Engineering National;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Biomedical measurement; Containers; Clustering algorithms; Big Data; Atmospheric measurements; Particle measurements; Entropy;

机译：生物医学测量;容器;聚类算法;大数据;大气测量;粒子测量;熵;

相似文献

外文文献
中文文献
专利

1. Effect of incremental feature enrichment on healthcare text classification system: A machine learning paradigm [J] . Srivastava Saurabh Kumar, Singh Sandeep Kumar, Suri Jasjit S. Computer Methods and Programs in Biomedicine: An International Journal Devoted to the Development, Implementation and Exchange of Computing Methodology and Software Systems in Biomedical Research and Medical Practice . 2019,第期

机译：增量特征富集对医疗保健文本分类系统的影响：机器学习范式
2. Gene selection and classification for cancer microarray data based on machine learning and similarity measures [J] . Qingzhong Liu, Andrew H Sung, Zhongxue Chen, BMC Genomics . 2011,第SUPPLEMENTa5期

机译：基于机器学习和相似性度量的癌症微阵列数据的基因选择和分类
3. Medical Dataset Classification: A Machine Learning Paradigm Integrating Particle Swarm Optimization with Extreme Learning Machine Classifier [J] . C. V.Subbulakshmi, S. N.Deepa ScientificWorldJournal . 2015,第3期

机译：医疗数据集分类：通过极端学习机分类器集成粒子群优化的机器学习范式
4. An Effective TF/IDF-Based Text-to-Text Semantic Similarity Measure for Text Classification [C] . Shereen Albitar, Sebastien Fournier, Bernard Espinasse International conference on web information systems engineering . 2014

机译：一种有效的基于TF / IDF的文本到文本语义相似度度量用于文本分类
5. Machine Learning and Text Analysis Using Clustering, Classification, Categorization for Applied Industry Research and Its Effect on Trends and Prediction Analysis of a Doctor of Professionals Studies in Computing Dissertation Categories [D] . Haigler, Ashley. 2021

机译：采用集群，分类，分类，应用行业研究的机器学习和文本分析及其对计算论文中专业人士研究博士趋势和预测分析的影响
6. Gene selection and classification for cancer microarray data based on machine learning and similarity measures [O] . Qingzhong Liu, Andrew H Sung, Zhongxue Chen, 2011

机译：基于机器学习和相似性度量的癌症微阵列数据的基因选择和分类
7. Gene selection and classification for cancer microarray data based on machine learning and similarity measures [O] . Qingzhong Liu, Andrew H Sung, Zhongxue Chen, 2011

机译：基于机器学习和相似性度量的癌症微阵列数据的基因选择和分类

A Mechanics-Based Similarity Measure for Text Classification in Machine Learning Paradigm

摘要

著录项

相似文献

相关主题

期刊订阅