首页> 外文会议> >Sentiment Classification Using Paragraph Vector and Cognitive Big Data Semantics on Apache Spark
【24h】

Sentiment Classification Using Paragraph Vector and Cognitive Big Data Semantics on Apache Spark

机译:在Apache Spark上使用段落向量和认知大数据语义进行情感分类

获取原文

摘要

Apache Spark allows us to write a distributed version of any machine learning algorithm, which can be easily scaled up for a larger dataset on a cluster of commodity hardware. In this paper, we propose the hybridization of paragraph vector with distributed, parallel versions of well-known six machine learning techniques for sentiment analysis. We employed a distributed implementation of neural network language model to obtain paragraph vectors for a given corpus. On the paragraph vectors so obtained, we employed a host of distributed classification algorithms available in Apache Spark to perform sentiment classification. We considered two approaches viz. Bag-of-Words based document-term matrix (DTM) and hashing-trick based DTM as two baseline methods for comparison. We experimented with a movie review dataset of size 992 MB. Among the six classifiers employed, MLP turned out to be statistically the same as GBT and SVM, while it statistically significantly outperformed the rest of classifiers by yielding an area under of ROC curve (AUC) of 95.44%.
机译:Apache Spark允许我们编写任何机器学习算法的分布式版本,可以轻松扩展该规模以针对商品硬件集群上的更大数据集。在本文中,我们提出将段落向量与著名的六种机器学习技术的分布式并行版本进行混合,以进行情感分析。我们采用神经网络语言模型的分布式实现来获取给定语料库的段落向量。在这样获得的段落向量上,我们采用了Apache Spark中可用的许多分布式分类算法来执行情感分类。我们考虑了两种方法。基于词袋的文档术语矩阵(DTM)和基于哈希技巧的DTM是进行比较的两种基准方法。我们尝试了992 MB大小的电影评论数据集。在使用的六个分类器中,MLP在统计上与GBT和SVM相同,而在统计上显着优于ROC曲线(AUC)的面积为95.44%,优于其他分类器。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号