首页> 外文期刊>International Journal of Uncertainty, Fuzziness, and Knowledge-based Systems >RE-IDENTIFYING REGISTER DATA BY SURVEY DATA USING CLUSTER ANALYSIS: AN EMPIRICAL STUDY
【24h】

RE-IDENTIFYING REGISTER DATA BY SURVEY DATA USING CLUSTER ANALYSIS: AN EMPIRICAL STUDY

机译:使用聚类分析通过调查数据重新识别注册数据:实证研究

获取原文
获取原文并翻译 | 示例

摘要

More and more empirical researchers from universities or research centres like to use register or survey data collected by statistical agencies or the social security system, since these data can by used for several empirical studies, e.g. the analysis of special groups or quantitative effects of economic or social policies. Most of the data required have to be (factually) anonymised before they are disseminated to preserve confidentiality. In the area of statistics on households and individuals this path has been pursued in Germany for several years. The transmission of de facto anonymised datafiles has proved to be a good form of co-operation between scientists and statisticians. Factual anonymity of the data depends on the costs and benefits of a potential re-identification. The paper assumes that the intruder only accepts low costs. Therefore he uses a cluster analysis module that is available in a standard statistical software package to re-identify persons. After a description of the method different factors influencing the re-identification risk are studied using German employment statistics (register data) and the German Life History Study (survey data). The factors are: sample fraction and number of (irrelevant) variables. The results show, that the number of identifiable persons is remarkable high. Furthermore it can be confirmed with the cluster analysis that the number of re-identifiable records increases with increasing sampling fraction and that irrelevant variables reduce this number.
机译:来自大学或研究中心的越来越多的经验研究者喜欢使用统计机构或社会保障系统收集的登记或调查数据,因为这些数据可用于一些经验研究,例如对特殊群体的分析或经济或社会政策的定量影响。分发所需的大多数数据(实际上)必须是匿名的,以保持机密性。在家庭和个人统计领域,这条道路已经在德国使用了几年。事实证明,匿名数据文件的传输是科学家和统计学家之间合作的良好形式。数据的真实匿名性取决于潜在重新识别的成本和收益。本文假设入侵者仅接受低成本。因此,他使用标准统计软件包中提供的聚类分析模块来重新识别人员。在对方法进行描述之后,使用德国就业统计数据(登记数据)和德国生活史研究(调查数据)研究了影响重新识别风险的不同因素。这些因素是:样本分数和(无关)变量的数量。结果表明,可识别人数很高。此外,通过聚类分析可以确认,可重新识别记录的数量随采样分数的增加而增加,而无关的变量会减少该数量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号