Machine learning classification of entrepreneurs in British historical census data

Piero Montebruno; Robert J. Bennett; Harry Smith; Carry van Lieshout

首页> 外文期刊>Information Processing & Management >Machine learning classification of entrepreneurs in British historical census data

【24h】

Machine learning classification of entrepreneurs in British historical census data

机译：英国历史人口普查数据中企业家的机器学习分类

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper presents a binary classification of entrepreneurs in British historical data based on the recent availability of big data from the I-CeM dataset. The main task of the paper is to attribute an employment status to individuals that did not fully report entrepreneur status in earlier censuses (1851-1881). The paper assesses the accuracy of different classifiers and machine learning algorithms, including Deep Learning, for this classification problem. We first adopt a ground-truth dataset from the later censuses to train the computer with a Logistic Regression (which is standard in the literature for this kind of binary classification) to recognize entrepreneurs distinct from non-entrepreneurs (i.e. workers). Our initial accuracy for this base-line method is 0.74. We compare the Logistic Regression with ten optimized machine learning algorithms: Nearest Neighbors, Linear and Radial Support Vector Machine, Gaussian Process, Decision Tree, Random Forest, Neural Network, AdaBoost, Naive Bayes, and Quadratic Discriminant Analysis. The best results are boosting and ensemble methods. AdaBoost achieves an accuracy of 0.95. Deep-Learning, as a standalone category of algorithms, further improves accuracy to 0.96 without using the rich text-data that characterizes the OccString feature, a string of up to 500 characters with the full occupational statement of each individual collected in the earlier censuses. Finally, and now using this OccString feature, we implement both shallow (bag-of-words algorithm) learning and Deep Learning (Recurrent Neural Network with a Long Short-Term Memory layer) algorithms. These methods all achieve accuracies above 0.99 with Deep Learning Recurrent Neural Network as the best model with an accuracy of 0.9978. The results show that standard algorithms for classification can be outperformed by machine learning algorithms. This confirms the value of extending the techniques traditionally used in the literature for this type of classification problem.

机译：本文基于来自I-CeM数据集的大数据的最新可用性，对英国历史数据中的企业家进行了二进制分类。该论文的主要任务是将就业状况归因于在较早的人口普查中（1851-1881）没有完全报告企业家身份的个人。本文针对此分类问题评估了不同分类器和机器学习算法（包括深度学习）的准确性。我们首先从后来的人口普查中采用真实的数据集，以Logistic回归（这是此类二进制分类的文献标准）对计算机进行训练，以识别与非企业家（即工人）不同的企业家。此基准方法的初始精度为0.74。我们将Logistic回归与十种优化的机器学习算法进行了比较：最近邻，线性和径向支持向量机，高斯过程，决策树，随机森林，神经网络，AdaBoost，朴素贝叶斯和二次判别分析。最好的结果是增强和合奏方法。 AdaBoost的精度为0.95。深度学习作为一种独立的算法类别，无需使用表征OccString功能的丰富文本数据即可将精度进一步提高到0.96，该字符串最多包含500个字符，并包含早期普查中每个人的完整职业陈述。最后，现在使用此OccString功能，我们既实现了浅（单词袋算法）学习又实现了深度学习（具有长短期记忆层的递归神经网络）算法。这些方法均以深度学习递归神经网络为最佳模型，且精度为0.9978，可达到0.99以上的精度。结果表明，用于分类的标准算法可以胜过机器学习算法。这证实了扩展文献中传统上用于此类分类问题的技术的价值。

著录项

来源
《Information Processing & Management》 |2020年第3期|102210.1-102210.22|共22页
作者
Piero Montebruno; Robert J. Bennett; Harry Smith; Carry van Lieshout;
展开▼
作者单位

Department of Geography and Cambridge Group for the History of Population and Social Structure University of Cambridge Downing Place Cambridge CB2 3EN UK;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Machine learning; Deep learning; Logistic regression; Classification; Big data; Census;

机译：机器学习;深度学习;逻辑回归分类;大数据;人口普查;

相似文献

外文文献
中文文献
专利

1. Combining family history and machine learning to link historical records: The Census Tree data set [J] . Price Joseph, Buckles Kasey, Van Leeuwen Jacob, Explorations in economic history . 2021,第Apra期

机译：组合家族史和机器学习链接历史记录：人口普查树数据集
2. A Survey on Data Classification using Machine Learning TechniquesA Survey on Data Classification using Machine Learning Techniques [J] . Dr. Chandra.E, Rajeswari .J International Journal of Engineering Science and Technology . 2011,第10期

机译：使用机器学习技术进行数据分类的调查使用机器学习技术进行数据分类的调查
3. Connecting historical and contemporary small-area geography in Britain: The creation of digital boundary data for 1971 and 1981 census units [J] . NIGEL WALFORD International Journal of Geographical Information Science . 2005,第7期

机译：连接英国的历史和当代小区域地理：创建1971年和1981年人口普查单位的数字边界数据
4. Breaking the Zuckerberg Myth: Successful Entrepreneurs Have 10 Years of Prior Employment: Utilizing Data Science and Machine Learning to Study Socio-Economic Patterns Among Successful Entrepreneurs [C] . Thomas Ferry, Ikhlaq Sidhu, Mudit Goyal, IEEE International Conference on Engineering, Technology and Innovation . 2018

机译：打破扎克伯格神话：成功的企业家有10年的现任就业：利用数据科学和机器学习成功创业者的社会经济模式
5. Semi-Supervised Machine Learning Techniques for Classification of Evolving Data in Pattern Recognition =TECHNIQUES SEMI-SUPERVISéES D'APPRENTISSAGE MACHINE POUR LA CLASSIFICATION DES DONNéES EN éVOLUTION EN RECONNAISSANCE DE FORMES [D] . Tencer, Lukas. 2017

机译：半监督机器学习技术，用于模式识别中不断发展的数据分类=在表单识别中对数据进行分类的半监督机器学习技术
6. Retracted: Medical Dataset Classification: A Machine Learning Paradigm Integrating Particle Swarm Optimization with Extreme Learning Machine Classifier [O] . The Scientific World Journal 2016

机译：缩回：医学数据集分类：结合粒子群优化与极限学习机分类器的机器学习范例
7. Retracted: Medical Dataset Classification: A Machine Learning Paradigm Integrating Particle Swarm Optimization with Extreme Learning Machine Classifier [O] . 2016

机译：缩回：医疗数据集分类：通过极端学习机分类器整合粒子群优化的机器学习范式

Machine learning classification of entrepreneurs in British historical census data

摘要

著录项

相似文献

相关主题

期刊订阅