...
首页> 外文期刊>Information Systems Research >Correcting Misclassification Bias in Regression Models with Variables Generated via Data Mining
【24h】

Correcting Misclassification Bias in Regression Models with Variables Generated via Data Mining

机译:使用数据挖掘生成的变量纠正回归模型中的错误分类偏差

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

As a result of advances in data mining, more and more empirical studies in the social sciences apply classification algorithms to construct independent or dependent variables for further analysis via standard regression methods. In the classification phase of these studies, researchers need to subjectively choose a classification performance metric for optimization in the standard procedure. No matter which performance metric is chosen, the constructed variable still includes classification error because those variables cannot be classified perfectly. The misclassification of constructed variables will lead to inconsistent regression coefficient estimates in the following phase, which has been documented as a problem of measurement error in the econometrics literature. The pioneering discussions on the issue of estimation inconsistency because of misclassification in these studies have been provided. Our study attempts to investigate systematically the theoretical foundation of this problem when a newly constructed variable is used as the independent or dependent variable in linear and nonlinear regressions. Our theoretical analysis shows that consistent regression estimators can be recovered in all models studied in this paper. The main implication of our theoretical result is that researchers do not need to tune the classification algorithm to minimize the inconsistency of estimated regression coefficients because the inconsistency can be corrected by theoretical formulas, even when the classification accuracy is poor. Instead, we propose that a classification algorithm should be tuned to minimize the standard error of the focal regression coefficient derived based on the corrected formula. As a result, researchers can derive a consistent and most precise estimator in all models studied in this paper.
机译:由于数据挖掘的进步,社会科学中的越来越多的经验研究适用分类算法来构建独立或依赖变量,以通过标准回归方法进行进一步分析。在这些研究的分类阶段,研究人员需要主观地选择分类性能度量,以便在标准过程中进行优化。无论选择哪个性能度量,构造的变量仍然包括分类错误,因为这些变量不能完全分类。构造变量的错误分类将导致以下阶段的回归系数估计不一致,该估计被记录为经济学文献中的测量误差问题。提供了对这些研究中错误分类估计不一致问题的开创性讨论。我们的研究尝试系统地调查此问题的理论基础当新构造的变量用作线性和非线性回归中的独立或依赖变量时。我们的理论分析表明,在本文研究的所有模型中可以恢复一致的回归估计。我们理论结果的主要含义是研究人员不需要调整分类算法以最小化估计的回归系数的不一致,因为即使在分类精度差的情况下,也可以通过理论公式纠正不一致。相反,我们建议应调整分类算法以最小化基于校正公式导出的焦点回归系数的标准误差。因此,研究人员可以在本文研究的所有模型中得出一致和最精确的估计。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号