K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

机译：k重复子字符串：一种串算法的文本数据保留发布的算法方法

获取原文

页面导航

摘要
著录项
相关主题

摘要

De-identifying textual data is an important task for publishing and sharing the data among researchers while protecting privacy of individuals referenced therein. While supervised learning approaches are successfully applied to the task in the clinical domain, existing methods are hard to transfer to different do-mains and languages because they require a considerable cost and time for preparation of linguistic resources. This paper presents an efficient unsupervised algorithm to detect all substrings occurring less than k times in the input string, based on the assumption that such rare sequences are likely to contain sensitive information such as names of people and rare diseases that may identify individuals. The proposed algorithm works in asymptotically and empirically linear time against the input size when k is a constant. Empirical evaluation on the i2b2 (Informatics for Integrating Biology and Bedside) dataset shows the effectiveness of the algorithm in comparison to baselines that use simple word frequencies.

机译：取消识别文本数据是用于在研究人员之间发布和分享数据的重要任务，同时保护其中引用的个人的隐私。虽然监督学习方法已成功应用于临床领域的任务，但现有的方法很难转移到不同的Do-Mains和语言，因为它们需要相当大的成本和时间来准备语言资源。本文介绍了一种有效的无监督算法，以检测输入字符串中发生小于k次的所有子字符串，基于此珍稀序列可能包含诸如可能识别个人的人的名称和罕见疾病等敏感信息。当k是常数时，所提出的算法在渐近和仿真线性时间内工作，抵消输入大小。关于I2B2（集成生物学和床头柜的信息学）数据集的实证评估显示了算法与使用简单字频率的基线相比的算法。

著录项

来源
《Pacific Asia Conference on Language, Information and Computation》|2015年||共10页
会议地点
作者
Yusuke Matsubara; Koiti Hasida;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算机网络;
关键词
入库时间 2022-08-20 20:06:20

K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

摘要

著录项

相关主题

期刊订阅