首页> 外文期刊>Information Processing & Management >Machine learning classification of entrepreneurs in British historical census data
【24h】

Machine learning classification of entrepreneurs in British historical census data

机译:英国历史人口普查数据中企业家的机器学习分类

获取原文
获取原文并翻译 | 示例
           

摘要

This paper presents a binary classification of entrepreneurs in British historical data based on the recent availability of big data from the I-CeM dataset. The main task of the paper is to attribute an employment status to individuals that did not fully report entrepreneur status in earlier censuses (1851-1881). The paper assesses the accuracy of different classifiers and machine learning algorithms, including Deep Learning, for this classification problem. We first adopt a ground-truth dataset from the later censuses to train the computer with a Logistic Regression (which is standard in the literature for this kind of binary classification) to recognize entrepreneurs distinct from non-entrepreneurs (i.e. workers). Our initial accuracy for this base-line method is 0.74. We compare the Logistic Regression with ten optimized machine learning algorithms: Nearest Neighbors, Linear and Radial Support Vector Machine, Gaussian Process, Decision Tree, Random Forest, Neural Network, AdaBoost, Naive Bayes, and Quadratic Discriminant Analysis. The best results are boosting and ensemble methods. AdaBoost achieves an accuracy of 0.95. Deep-Learning, as a standalone category of algorithms, further improves accuracy to 0.96 without using the rich text-data that characterizes the OccString feature, a string of up to 500 characters with the full occupational statement of each individual collected in the earlier censuses. Finally, and now using this OccString feature, we implement both shallow (bag-of-words algorithm) learning and Deep Learning (Recurrent Neural Network with a Long Short-Term Memory layer) algorithms. These methods all achieve accuracies above 0.99 with Deep Learning Recurrent Neural Network as the best model with an accuracy of 0.9978. The results show that standard algorithms for classification can be outperformed by machine learning algorithms. This confirms the value of extending the techniques traditionally used in the literature for this type of classification problem.
机译:本文基于来自I-CeM数据集的大数据的最新可用性,对英国历史数据中的企业家进行了二进制分类。该论文的主要任务是将就业状况归因于在较早的人口普查中(1851-1881)没有完全报告企业家身份的个人。本文针对此分类问题评估了不同分类器和机器学习算法(包括深度学习)的准确性。我们首先从后来的人口普查中采用真实的数据集,以Logistic回归(这是此类二进制分类的文献标准)对计算机进行训练,以识别与非企业家(即工人)不同的企业家。此基准方法的初始精度为0.74。我们将Logistic回归与十种优化的机器学习算法进行了比较:最近邻,线性和径向支持向量机,高斯过程,决策树,随机森林,神经网络,AdaBoost,朴素贝叶斯和二次判别分析。最好的结果是增强和合奏方法。 AdaBoost的精度为0.95。深度学习作为一种独立的算法类别,无需使用表征OccString功能的丰富文本数据即可将精度进一步提高到0.96,该字符串最多包含500个字符,并包含早期普查中每个人的完整职业陈述。最后,现在使用此OccString功能,我们既实现了浅(单词袋算法)学习又实现了深度学习(具有长短期记忆层的递归神经网络)算法。这些方法均以深度学习递归神经网络为最佳模型,且精度为0.9978,可达到0.99以上的精度。结果表明,用于分类的标准算法可以胜过机器学习算法。这证实了扩展文献中传统上用于此类分类问题的技术的价值。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号