首页> 外文会议>Privacy in statistical databases >Generating Useful Test Data for Complex Linked Employer-Employee Datasets
【24h】

Generating Useful Test Data for Complex Linked Employer-Employee Datasets

机译:为复杂的链接的雇主-雇员数据集生成有用的测试数据

获取原文
获取原文并翻译 | 示例

摘要

When data access for external researchers is difficult or time consuming it can be beneficial if test datasets that mimic the structure of the original data are disseminated in advance. With these test data researchers can develop their analysis code or can decide whether the data are suitable for their planned research before they go through the lengthly process of getting access at the research data center. The aim of these data is not to provide any meaningful results. Instead it is important to maintain the structure of the data as closely as possible including skip patterns, logical constraints between the variables, and longitudinal relationships so that any code that is developed using these test data will also run on the original data without further modifications. Achieving this goal can be challenging for complex datasets such as linked employer-employee datasets (LEED) where the links between the establishments and the employees also need to be maintained. Using the LEED of the Institute for Employment Research we illustrate how useful test data can be developed for such complex datasets. Our approach mainly relies on traditional statistical disclosure control (SDC) techniques such as data swapping and noise addition for data protection. Since statistical inferences need not be preserved, high swapping rates can be applied to sufficiently protect the data. At the same time it is straightforward to maintain the structure of the data by adding some constraints on the swapping procedure.
机译:当外部研究人员的数据访问困难或耗时时,如果可以预先传播模仿原始数据结构的测试数据集,则将是有益的。使用这些测试数据,研究人员可以在经过冗长的访问研究数据中心的过程之前,开发自己的分析代码或确定数据是否适合计划的研究。这些数据的目的不是提供任何有意义的结果。相反,重要的是保持尽可能紧密的数据结构,包括跳过模式,变量之间的逻辑约束和纵向关系,以便使用这些测试数据开发的任何代码也可以在原始数据上运行而无需进一步修改。对于复杂的数据集(如链接的雇主-雇员数据集(LEED)),要实现此目标可能会充满挑战,因为在这些数据集中,企业和员工之间的链接也需要维护。使用就业研究所的LEED,我们说明了如何为如此复杂的数据集开发有用的测试数据。我们的方法主要依靠传统的统计披露控制(SDC)技术,例如数据交换和添加噪声以进行数据保护。由于不需要保留统计推断,因此可以应用高交换率来充分保护数据。同时,通过在交换过程中添加一些约束来维护数据的结构很简单。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号