...
【24h】

Towards privacy preserving unstructured big data publishing

机译:朝着隐私保留非结构化大数据出版

获取原文
获取原文并翻译 | 示例
           

摘要

Various sources and sophisticated tools are used to gather and process the comparatively large volume of data or big data that sometimes leads to privacy disclosure (at broader or finer level) for the data owner. Privacy preserving data publishing approaches such as k-anonymity, l-diversity, and t-closeness are very well used to de-identify data, however, chances of re-identification of attributes always exist as data is collected from multiple sources such as public web, social media, Internet whereabouts, and sensors that are highly prone to data linkages. In literature, k-anonymity stands out amongst the most popular mainstream data anonymization approaches that can also be used for large sized data. However, applying k-anonymization for variety of data (especially unstructured data) is difficult in the traditional way, due to the fact that it requires the given data to be classified into the personal data, the quasi identifiers, and the sensitive data. We identify existing approaches from the literature of Natural Language Processing(NLP) to convert the unstructured data to structured form in order to apply k-anonymization over the generated structured records. We adopt a two phase Conditional Random Field (CRF) based Named Entity Recognition (NER) approach to represent unstructured data into the structured form. Further, we propose an Improved Scalable k-Anonymization (ImSKA) to anonymize the well represented unstructured data that achieves privacy preserving unstructured big data publishing. We compare both of the propose approaches namely NER and ImSKA with existing approaches and the results show that our proposed solutions outperform the existing approaches in terms of F1 score and Normalized Cardinality Penalty (NCP), respectively. Since, NER approaches are widely used for bio-medical datasets, we have also used a well-known Bio-NER dataset called GENIA corpus for measuring the performance.
机译:各种来源和复杂的工具用于收集和处理相对大量的数据或大数据,有时会导致数据所有者的隐私披露(在更广泛或更精细的级别)。隐私保留数据发布等方法,如k-匿名,l-多样性和t闭合非常好地用于去识别数据,但是,重新识别属性的机会始终存在,因为数据从公众诸如数据收集数据网络,社交媒体,互联网下落,以及高度容易出现数据联系的传感器。在文献中,K-Anymony突出了最受欢迎的主流数据匿名方法,该方法也可以用于大型数据。然而,由于它要求给定数据被分类为个人数据,准标识符和敏感数据,因此难以以传统方式应用于各种数据(特别是非结构化数据)的k-anymation我们识别来自自然语言处理的文献(NLP)的现有方法,以将非结构化数据转换为结构形式,以便在生成的结构化记录上应用k-匿名化。我们采用基于两个相位条件随机字段(CRF)命名实体识别(NER)方法来表示非结构化数据进入结构形式。此外,我们提出了一种改进的可扩展k - 匿名化(IMSKA),以匿名化良好代表的非结构化数据,该数据实现了保留了非结构化大数据发布的隐私。我们比较所有提议的方法都具有现有方法,结果表明,我们提出的解决方案分别以F1分数和规范化的基数惩罚(NCP)在现有方法方面。由于NER方法广泛用于生物医疗数据集,因此我们还使用了一个称为Genia Corpus的知名生物网数据集来测量性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号