Multivariate understanding of income and expenditure in United States households with statistical learning

Hu Mingzhao

摘要

In recent decades, data-driven approaches have been developed to analyze demographic and economic surveys on a large scale. Despite advances in multivariate techniques and learning methods, in practice the analysis and interpretations are often focused on a small portion of available data and limited to a single perspective. This paper aims to utilize a selected array of multivariate statistical learning methods in the analysis of income and expenditure patterns of households in the United States using the Public-Use Microdata from the Bureau of Labor Statistics Consumer Expenditure Survey (CE). The objective is to propose an effective data pipeline that provides visualizations and comprehensive interpretations for applications in governmental regulations and economic research, using thirty-five original survey variables covering the categories of demographics, income and expenditure. Details on feature extraction not only showcase CE as a unique publicly-shared big data resource with high potential for in-depth analysis, but also assist interested researchers with pre-processing. Challenges from missing values and categorical variables are treated in the exploratory analysis, while statistical learning methods are comprehensively employed to address multiple economic perspectives. Principal component analysis suggests that after-tax income, wage/salary income, and the quarterly expenditure in food, housing and overall as the five most important of the selected variables, while cluster analysis identifies and visualizes the implicit structure between variables. Based on this, canonical correlation analysis reveals high correlation between two selected groups of variables, one of income and the other of expenditure.

机译：近几十年来，已经开发了数据驱动的方法来大规模分析人口和经济调查。尽管多变量技术和学习方法取得了进步，但在实践中，分析和解释往往集中在一小部分可用数据上，并且仅限于单一视角。本文旨在利用一系列选定的多元统计学习方法，使用美国劳工统计局消费者支出调查（CE）的公共使用微观数据来分析美国家庭的收入和支出模式。目标是提出一个有效的数据管道，使用涵盖人口统计、收入和支出类别的 35 个原始调查变量，为政府法规和经济研究中的应用提供可视化和全面解释。特征提取的细节不仅展示了CE作为一个独特的公开共享的大数据资源，具有很高的深入分析潜力，而且还有助于感兴趣的研究人员进行预处理。在探索性分析中处理了缺失值和分类变量的挑战，同时综合采用统计学习方法来解决多个经济视角。主成分分析表明，税后收入、工资/薪金收入以及食品、住房和总体的季度支出是所选变量中最重要的五个变量，而聚类分析则识别并可视化变量之间的隐性结构。基于此，典型相关分析揭示了两组选定变量之间的高度相关性，一组是收入变量，另一组是支出变量。

Multivariate understanding of income and expenditure in United States households with statistical learning

摘要

著录项

引文网络

相关主题

期刊订阅