Learning to Match and Cluster Large High-Dimensional Data Sets For Data Integration

机译：学习匹配和聚类大型高维数据集以进行数据集成

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Part of the process of data integration is determining which sets of identifiers refer to the same real-world entities. In integrating databases found on the Web or obtained by using information extraction methods, it is often possible to solve this problem by exploiting similarities in the textual names used for objects in different databases. In this paper we describe techniques for clustering and matching identifier names that are both scalable and adaptive, in the sense that they can be trained to obtain better performance in a particular domain. An experimental evaluation on a number of sample datasets shows that the adaptive method sometimes performs much better than either of two non-adaptive baseline systems, and is nearly always competitive with the best baseline system.

机译：数据集成过程的一部分是确定哪些集合标识符引用相同的真实实体。在集成在网上的数据库或通过使用信息提取方法获得的数据库，通常可以通过在不同数据库中用于对象中使用的文本名称中的相似性来解决此问题。在本文中，我们描述了用于群集和匹配既可扩展和自适应的匹配标识符名称的技术，因为它们可以训练以获得特定域中的更好性能。关于许多样本数据集的实验评估表明，自适应方法有时比两个非自适应基线系统中的任何一个更好地执行，并且几乎始终与最佳基线系统具有竞争力。

著录项

来源
《Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Jul 23-26, 2002, Edmonton 》|2002年|p.475-480|共6页
会议地点
作者
William W. Cohen; Jacob Richman;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类自动化技术、计算机技术 ;
关键词
learning; clustering; text mining; large datasets;

机译：学习;集群文本挖掘大型数据集;

相似文献

外文文献
中文文献
专利

1. Integrative clustering of high-dimensional data with joint and individual clusters [J] . Hellton Kristoffer H., Thoresen Magne Biostatistics . 2016 ,第3期

机译：具有联合和单个集群的高维数据的集成集群
2. Query-learning-based iterative feature-subset selection for learning from high-dimensional data sets [J] . Hiroshi Mamitsuka Knowledge and information systems . 2006 ,第1期

机译：从高维数据集学习的基于查询学习的迭代特征子集选择
3. Query-learning-based iterative feature-subset selection for learning from high-dimensional data sets [J] . Hiroshi Mamitsuka Knowledge and Information Systems . 2006 ,第1期

机译：从高维数据集学习的基于查询学习的迭代特征子集选择
4. Learning to match and cluster large high-dimensional data sets for data integration [C] . William W. Cohen, Jacob Richman Proceedings of the Eighth ACM SIGKDD international conference on knowledge discovery and data mining(KDD-2000) . 2002

机译：学习匹配和聚类大型高维数据集以进行数据集成
5. Efficient computation of k-nearest neighbor graphs for large high-dimensional data sets on gpu clusters. [D] . Dashti, Ali. 2013

机译：有效计算gpu群集上的大型高维数据集的k最近邻图。
6. Efficient Computation of k-Nearest Neighbour Graphs for Large High-Dimensional Data Sets on GPU Clusters [O] . Ali Dashti, Ivan Komarov, Roshan M. D’Souza -1

机译：GPU群集上大型高维数据集的k最近邻图的高效计算
7. Learning to Match and Cluster Large High-Dimensional Data Sets For Data Integration [O] . William W. Cohen, Jacob Richman 2002

机译：学习匹配和聚类大型高维数据集以进行数据集成
8. Statistical Analysis of Very High-Dimensional Data Sets of Hierarchically Structured Binary Variables with Missing Data and Application to Marine Corps Readiness Evaluations [R] . Zacks, S., Marlow, W. H., Brier, S. S. 1983

机译：具有缺失数据的分层结构二元变量的超高维数据集的统计分析及其在海军陆战队准备评估中的应用

Learning to Match and Cluster Large High-Dimensional Data Sets For Data Integration

摘要

著录项

相似文献

相关主题

期刊订阅