Resampling imbalanced data to detect fake reviews using machine learning classifiers and textual-based features

Budhi Gregorius Satia; Chiong Raymond; Wang Zuli

首页> 外文期刊>Multimedia Tools and Applications >Resampling imbalanced data to detect fake reviews using machine learning classifiers and textual-based features

【24h】

Resampling imbalanced data to detect fake reviews using machine learning classifiers and textual-based features

机译：使用机器学习分类器和基于文本的功能重新采样不平衡数据以检测假审查

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Fraudulent online sellers often collude with reviewers to garner fake reviews for their products. This act undermines the trust of buyers in product reviews, and potentially reduces the effectiveness of online markets. Being able to accurately detect fake reviews is, therefore, critical. In this study, we investigate several preprocessing and textual-based featuring methods along with machine learning classifiers, including single and ensemble models, to build a fake review detection system. Given the nature of product review data, where the number of fake reviews is far less than that of genuine reviews, we look into the results of each class in detail in addition to the overall results. We recognise from our preliminary analysis that, owing to imbalanced data, there is a high imbalance between the accuracies for different classes (e.g., 1.3% for the fake review class and 99.7% for the genuine review class), despite the overall accuracy looking promising (around 89.7%). We propose two dynamic random sampling techniques that are possible for textual-based featuring methods to solve this class imbalance problem. Our results indicate that both sampling techniques can improve the accuracy of the fake review class-for balanced datasets, the accuracies can be improved to a maximum of 84.5% and 75.6% for random under and over-sampling, respectively. However, the accuracies for genuine reviews decrease to 75% and 58.8% for random under and over-sampling, respectively. We also discover that, for smaller datasets, the Adaptive Boosting ensemble model outperforms other single classifiers; whereas, for larger datasets, the performance improvement from ensemble models is insignificant compared to the best results obtained by single classifiers.

机译：欺诈性在线销售商经常与审稿人员丛及，为他们的产品提供假审查。该法案破坏了产品评论中买家的信任，并可能降低在线市场的有效性。因此，能够准确地检测到虚假评论，是至关重要的。在这项研究中，我们调查了几种预处理和基于文本的特色方法以及机器学习分类器，包括单一和集合模型，构建假审查检测系统。鉴于产品审查数据的性质，虚假审查的数量远远低于真实评论，除了整体结果外，我们还详细介绍了每个课程的结果。我们认识到我们的初步分析，由于数据不平衡，不同类别的准确性（例如，对于假审查课程的1.3％，对于真正审查课程为99.7％），仍有高度不平衡，尽管总体准确性看起来很有希望（约89.7％）。我们提出了两个动态随机采样技术，可以解决基于文本的特色方法来解决此类不平衡问题。我们的结果表明，两种采样技术都可以提高假审查类别的准确性 - 对于平衡数据集，可以分别提高准确度至最多84.5％和75.6％，分别为随机抽样。然而，对于随机的下列和过度采样，真正评论的准确性降低至75％和58.8％。我们还发现，对于较小的数据集，自适应升压集合模型优于其他单个分类器;虽然对于较大的数据集，与单级分类器获得的最佳结果相比，集合模型的性能改善是微不足道的。

著录项

来源
《Multimedia Tools and Applications》 |2021年第9期|13079-13097|共19页
作者
Budhi Gregorius Satia; Chiong Raymond; Wang Zuli;
展开▼
作者单位

Univ Newcastle Sch Elect Engn & Comp Callaghan NSW 2308 Australia|Petra Christian Univ Informat Dept Surabaya 60236 Indonesia;

Univ Newcastle Sch Elect Engn & Comp Callaghan NSW 2308 Australia;

Chengdu Univ Informat Technol Sch Cybersecur Chengdu 610225 Peoples R China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Fake review detection; Textual-based features; Machine learning; Imbalanced data;

机译：假审查检测;基于文本的特征;机器学习;不平衡数据;

相似文献

外文文献
中文文献
专利

1. An ensemble machine learning approach through effective feature extraction to classify fake news [J] . Saqib Hakak, Mamoun Alazab, Suleman Khan, Future generation computer systems . 2021,第Apra期

机译：通过有效的特征提取来分类假新闻的集合机器学习方法
2. Prediction of secondary testosterone deficiency using machine learning: A comparative analysis of ensemble and base classifiers, probability calibration, and sampling strategies in a slightly imbalanced dataset [J] . Monique Tonani Novaes, Osmar Luiz Ferreira de Carvalho, Pedro Henrique Guimar?es Ferreira, Informatics in Medicine Unlocked . 2021,第a期

机译：使用机器学习预测次级睾酮缺乏：略微不平衡数据集中的集合和基础分类器，概率校准和采样策略的比较分析
3. Boosting label weighted extreme learning machine for classifying multi-label imbalanced data [J] . Cheng Ke, Gao Shang, Dong Wenlu, Neurocomputing . 2020,第Auga25期

机译：促进标签加权极限学习机，用于分类多标签不平衡数据
4. A Supervised Machine Learning Approach to Detect Fake Online Reviews [C] . Rakibul Hassan, Md. Rabiul Islam International Conference on Computer and Information Technology . 2020

机译：检测假在线评论的监督机器学习方法
5. Diversified ensemble classifiers for highly imbalanced data learning and its application in bioinformatics. [D] . Ding, Zejin. 2011

机译：用于高度不平衡数据学习的多元化集成分类器及其在生物信息学中的应用。
6. Machine Learning Approach for Classifying Multiple Sclerosis Courses by Combining Clinical Data with Lesion Loads and Magnetic Resonance Metabolic Features [O] . Adrian Ion-Mărgineanu, Gabriel Kocevar, Claudio Stamile, 2017

机译：通过结合病灶负荷和磁共振代谢特征的临床数据对多发性硬化症课程进行分类的机器学习方法
7. Machine learning for network based intrusion detection : an investigation into discrepancies in findings with the KDD cup '99 data set and multi-objective evolution of neural network classifier ensembles from imbalanced data [O] . Engen Vegard 2010

机译：基于网络的入侵检测的机器学习：使用KDD杯'99数据集和不平衡数据的神经网络分类器集合的多目标进化研究结果的差异

Resampling imbalanced data to detect fake reviews using machine learning classifiers and textual-based features

摘要

著录项

相似文献

相关主题

期刊订阅