Efficient algorithms for data mining with federated databases .

机译：联合数据库的高效数据挖掘算法。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

The Internet era and high speed networks have ushered in the capabilities to have ready access to large amounts of geographically distributed data. Individuals, businesses, and governments recognize the value of this available resource to those who can transform the data into information. These databases, though valuable as individual entities, become significantly more valuable when they function as parts of a federated database and their data can be aggregated for collective mining or computations. This requires data mining algorithms to shift their focus from working with single databases to efficiently working with federated databases. Any such methodology must address issues relating to security and privacy of the data, cost of communication among the database nodes and the overlap of attributes among the databases. Most of the algorithms designed for mining single databases are not easily adaptable for federated databases.; Some existing approaches for mining horizontally and vertically partitioned datasets work by first sampling local databases, and then transferring the distorted versions of samples to a centralized node. The goal is to add noise to the data in such a way that some desirable statistical properties of the data remain very close to their original values. Addition of noise helps protecting the data privacy but there is some loss of information and communication cost can be large. If the data is not distorted and secure multi-party protocols are used to communicate for joint computation, the protocols are exponential, and they result in extremely heavy network traffic. To minimize the network traffic, it is desirable to maximize the amount of local computations on each of the participating nodes. We present in this work a new perspective on these two competing demands. In our approach, scalable and secure computation can be achieved on a general vertical partitioning of an implicit global database among the explicit federated components. We use an implicit aggregation space with the necessary data privacy designed in this space. Our algorithms efficiently compute first and second order statistical quantities, and inter-tuple distance based clusterings for data in the federated databases. We demonstrate these by computing the covariance matrix and the k-nearest neighbor and other clusterings for databases.; The information exchanged between the nodes hosting the databases is minimized and consists of summaries derived from multiple tuples of the databases. Privacy of the individual data tuples is preserved and secure multi party computation protocols may be applied to secure the sharing of summaries. Our algorithms are more capable than the other research in this area in that they can handle the more naturally occurring vertical partitions of an implicit global database with an arbitrary overlap in their attribute sets.; Some pattern recognition techniques use Principal Components Analysis (requiring covariance matrices) to identify features in the data and use these features in either classifier building or clustering. Some other pattern recognition techniques, such as hierarchical clustering or K-nearest neighbor algorithm, require computation of inter-tuple or tuple-cluster distances. Our algorithms enable both of these types of algorithms to be efficiently and privately computed in the context of federated databases.

机译：互联网时代和高速网络已经引入了可以立即访问大量地理分布数据的功能。个人，企业和政府都认识到这种可用资源对那些可以将数据转换为信息的人的价值。这些数据库虽然作为单独的实体有价值，但是当它们充当联合数据库的一部分并且可以将其数据进行汇总以进行集体挖掘或计算时，其价值将大大提高。这要求数据挖掘算法将其重点从使用单个数据库转移到有效使用联合数据库。任何此类方法都必须解决与数据的安全性和保密性，数据库节点之间的通信成本以及数据库之间的属性重叠有关的问题。设计用于挖掘单个数据库的大多数算法都不容易适应联邦数据库。一些用于挖掘水平和垂直划分的数据集的现有方法是通过首先对本地数据库进行采样，然后将失真的样本版本传输到集中式节点来工作的。目的是以某种方式将噪声添加到数据中，以使数据的某些所需统计属性保持非常接近其原始值。噪声的添加有助于保护数据的隐私，但是会损失一些信息，并且通信成本可能很高。如果数据没有失真，并且使用安全的多方协议进行通信以进行联合计算，则该协议将是指数级的，并且会导致网络通信量极大。为了最小化网络流量，期望最大化每个参与节点上的本地计算量。我们在这项工作中提出了关于这两个竞争需求的新观点。在我们的方法中，可以在显式联合组件之间对隐式全局数据库进行常规的垂直分区，从而实现可伸缩和安全的计算。我们使用隐式聚合空间，并在此空间中设计必要的数据保密性。我们的算法有效地计算出一阶和二阶统计量，以及基于元组间距离的联合数据库中数据的聚类。我们通过计算协方差矩阵和k近邻和其他数据库聚类来证明这些。承载数据库的节点之间交换的信息被最小化，并包含从数据库的多个元组派生的摘要。保留了各个数据元组的隐私，并且可以应用安全的多方计算协议来保护摘要的共享。我们的算法比该领域的其他研究更有能力，因为它们可以处理在属性集中具有任意重叠的，更自然发生的隐式全局数据库的垂直分区。一些模式识别技术使用主成分分析（需要协方差矩阵）来识别数据中的特征，并在分类器构建或聚类中使用这些特征。其他一些模式识别技术，例如层次聚类或K最近邻算法，需要计算元组间或元组集群距离。我们的算法使这两种类型的算法都可以在联合数据库的上下文中高效地进行私有计算。

著录项

作者
Young, Barrington R. St. A.;
展开▼
作者单位

University of Cincinnati.;

展开▼
授予单位 University of Cincinnati.;
学科 Computer Science.
学位 Ph.D.
年度 2007
页码 135 p.
总页数 135
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. An efficient algorithm for mining high utility patterns from incremental databases with one database scan [J] . Yun Unil, Ryang Heungmo, Lee Gangin, Knowledge-Based Systems . 2017,第MAY15期

机译：一种有效的算法，可通过一次数据库扫描从增量数据库中挖掘高实用性模式
2. Spatial Data Mining: Database Primitives, Algorithms and Efficient DBMS Support [J] . Martin Ester, Alexander Frommelt, Hans-Peter Kriegel, Data mining and knowledge discovery . 2000,第2a3期

机译：空间数据挖掘：数据库原语，算法和有效的DBMS支持
3. Spatial Data Mining: Database Primitives, Algorithms and Efficient DBMS Support [J] . Martin Ester, Alexander Frommelt, Hans-Peter Kriegel, Data mining and knowledge discovery . 2000,第2a3期

机译：空间数据挖掘：数据库原语，算法和有效的DBMS支持
4. Efficient Algorithms for Mining Frequent Weighted Itemsets from Weighted Items Databases [C] . Le Bac, Nguyen Huy, Vo Bay 2010 IEEE RIVF International Conference on Computing and Communication Technologies, Research, Innovation, and Vision for the Future . 2010

机译：从加权项目数据库中挖掘频繁加权项目集的高效算法
5. Efficient sequential and parallel algorithms for mining association rules in text databases [D] . Holt, John D. 2003

机译：用于挖掘文本数据库中关联规则的高效顺序和并行算法
6. Forecasting and optimizing Agrobacterium-mediated genetic transformation via ensemble model- fruit fly optimization algorithm: A data mining approach using chrysanthemum databases [O] . Mohsen Hesami, Milad Alizadeh, Roohangiz Naderi, 2020

机译：通过集合模型飞行优化算法预测和优化农杆菌介导的遗传转化：使用菊花数据库的数据挖掘方法
7. Enhancing association rules algorithms for mining distributed databases. Integration of fast BitTable and multi-agent association rules mining in distributed medical databases for decision support. [O] . Abdo Walid Adly Atteya 2012

机译：增强用于挖掘分布式数据库的关联规则算法。快速BitTable和多代理关联规则挖掘在分布式医疗数据库中的集成，以提供决策支持。

Efficient algorithms for data mining with federated databases .

摘要

著录项

相似文献

相关主题

期刊订阅