首页> 外文学位 >Efficient algorithms for data mining with federated databases .
【24h】

Efficient algorithms for data mining with federated databases .

机译:联合数据库的高效数据挖掘算法。

获取原文
获取原文并翻译 | 示例

摘要

The Internet era and high speed networks have ushered in the capabilities to have ready access to large amounts of geographically distributed data. Individuals, businesses, and governments recognize the value of this available resource to those who can transform the data into information. These databases, though valuable as individual entities, become significantly more valuable when they function as parts of a federated database and their data can be aggregated for collective mining or computations. This requires data mining algorithms to shift their focus from working with single databases to efficiently working with federated databases. Any such methodology must address issues relating to security and privacy of the data, cost of communication among the database nodes and the overlap of attributes among the databases. Most of the algorithms designed for mining single databases are not easily adaptable for federated databases.; Some existing approaches for mining horizontally and vertically partitioned datasets work by first sampling local databases, and then transferring the distorted versions of samples to a centralized node. The goal is to add noise to the data in such a way that some desirable statistical properties of the data remain very close to their original values. Addition of noise helps protecting the data privacy but there is some loss of information and communication cost can be large. If the data is not distorted and secure multi-party protocols are used to communicate for joint computation, the protocols are exponential, and they result in extremely heavy network traffic. To minimize the network traffic, it is desirable to maximize the amount of local computations on each of the participating nodes. We present in this work a new perspective on these two competing demands. In our approach, scalable and secure computation can be achieved on a general vertical partitioning of an implicit global database among the explicit federated components. We use an implicit aggregation space with the necessary data privacy designed in this space. Our algorithms efficiently compute first and second order statistical quantities, and inter-tuple distance based clusterings for data in the federated databases. We demonstrate these by computing the covariance matrix and the k-nearest neighbor and other clusterings for databases.; The information exchanged between the nodes hosting the databases is minimized and consists of summaries derived from multiple tuples of the databases. Privacy of the individual data tuples is preserved and secure multi party computation protocols may be applied to secure the sharing of summaries. Our algorithms are more capable than the other research in this area in that they can handle the more naturally occurring vertical partitions of an implicit global database with an arbitrary overlap in their attribute sets.; Some pattern recognition techniques use Principal Components Analysis (requiring covariance matrices) to identify features in the data and use these features in either classifier building or clustering. Some other pattern recognition techniques, such as hierarchical clustering or K-nearest neighbor algorithm, require computation of inter-tuple or tuple-cluster distances. Our algorithms enable both of these types of algorithms to be efficiently and privately computed in the context of federated databases.
机译:互联网时代和高速网络已经引入了可以立即访问大量地理分布数据的功能。个人,企业和政府都认识到这种可用资源对那些可以将数据转换为信息的人的价值。这些数据库虽然作为单独的实体有价值,但是当它们充当联合数据库的一部分并且可以将其数据进行汇总以进行集体挖掘或计算时,其价值将大大提高。这要求数据挖掘算法将其重点从使用单个数据库转移到有效使用联合数据库。任何此类方法都必须解决与数据的安全性和保密性,数据库节点之间的通信成本以及数据库之间的属性重叠有关的问题。设计用于挖掘单个数据库的大多数算法都不容易适应联邦数据库。一些用于挖掘水平和垂直划分的数据集的现有方法是通过首先对本地数据库进行采样,然后将失真的样本版本传输到集中式节点来工作的。目的是以某种方式将噪声添加到数据中,以使数据的某些所需统计属性保持非常接近其原始值。噪声的添加有助于保护数据的隐私,但是会损失一些信息,并且通信成本可能很高。如果数据没有失真,并且使用安全的多方协议进行通信以进行联合计算,则该协议将是指数级的,并且会导致网络通信量极大。为了最小化网络流量,期望最大化每个参与节点上的本地计算量。我们在这项工作中提出了关于这两个竞争需求的新观点。在我们的方法中,可以在显式联合组件之间对隐式全局数据库进行常规的垂直分区,从而实现可伸缩和安全的计算。我们使用隐式聚合空间,并在此空间中设计必要的数据保密性。我们的算法有效地计算出一阶和二阶统计量,以及基于元组间距离的联合数据库中数据的聚类。我们通过计算协方差矩阵和k近邻和其他数据库聚类来证明这些。承载数据库的节点之间交换的信息被最小化,并包含从数据库的多个元组派生的摘要。保留了各个数据元组的隐私,并且可以应用安全的多方计算协议来保护摘要的共享。我们的算法比该领域的其他研究更有能力,因为它们可以处理在属性集中具有任意重叠的,更自然发生的隐式全局数据库的垂直分区。一些模式识别技术使用主成分分析(需要协方差矩阵)来识别数据中的特征,并在分类器构建或聚类中使用这些特征。其他一些模式识别技术,例如层次聚类或K最近邻算法,需要计算元组间或元组集群距离。我们的算法使这两种类型的算法都可以在联合数据库的上下文中高效地进行私有计算。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号