【24h】

EVALUATING SAMPLING METHODS FOR REUSING KNOWLEDGE FROM LARGE AND ILL-STRUCTURED QUALITATIVE DATA SETS

机译:重用大量结构不良的定性数据集中的知识的评估采样方法

获取原文

摘要

The desire to use ever growing qualitative data sets of user generated content in the engineering design process in a computationally effective manner makes it increasingly necessary to draw representative samples. This work investigated the ability of alternative sampling algorithms to draw samples with conformance to characteristics of the original data set. Sampling methods investigated included: random sampling, interval sampling, fixed-increment (or systematic) sampling method, and stratified sampling. Data collected through the Vehicle Owner's Questionnaire, a survey administered by the U.S. National Highway Traffic Safety Administration, is used as a case study throughout this paper. The paper demonstrates that existing statistical methods may be used to evaluate goodness of fit for samples drawn from large bodies of qualitative data. Evaluation of goodness of fit not only provides confidence that a sample is representative of the data set from which it is drawn, but also yields valuable realtime feedback during the sampling process. This investigation revealed two interesting and counterintuitive trends in sampling algorithm performance. The first is that larger sample sizes do not necessarily lead to improved goodness of fit. The second is that depending on the details of implementation, data cleansing may degrade performance of data sampling algorithms rather than improving it. This work illustrates the importance of aligning sampling procedures to data structures and validating the conformance of samples to characteristics of the larger data set to avoid drawing erroneous conclusions based on unexpectedly biased samples of data.
机译:在工程设计过程中以计算上有效的方式使用用户生成的内容的不断增长的定性数据集的需求使得越来越有必要绘制代表性样本。这项工作研究了替代采样算法提取符合原始数据集特征的样本的能力。研究的抽样方法包括:随机抽样,间隔抽样,固定增量(或系统)抽样方法和分层抽样。通过“车主问卷调查”(由美国国家公路交通安全管理局管理的一项调查)收集的数据在整个本文中均用作案例研究。本文证明,现有的统计方法可用于评估从大量定性数据中提取的样本的拟合优度。拟合优度的评估不仅可以确保样品代表从中抽取数据的代表,而且还可以在采样过程中提供有价值的实时反馈。这项调查揭示了采样算法性能方面的两个有趣且违反直觉的趋势。首先是较大的样本量并不一定会导致拟合优度的提高。第二个问题是,根据实现的细节,数据清理可能会降低而不是改善数据采样算法的性能。这项工作说明了使采样程序与数据结构保持一致,并验证样本与较大数据集的特征的一致性的重要性,以避免基于出乎意料的数据样本得出错误的结论。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号