首页> 外文会议>Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Jul 23-26, 2002, Edmonton >Learning to Match and Cluster Large High-Dimensional Data Sets For Data Integration
【24h】

Learning to Match and Cluster Large High-Dimensional Data Sets For Data Integration

机译:学习匹配和聚类大型高维数据集以进行数据集成

获取原文

摘要

Part of the process of data integration is determining which sets of identifiers refer to the same real-world entities. In integrating databases found on the Web or obtained by using information extraction methods, it is often possible to solve this problem by exploiting similarities in the textual names used for objects in different databases. In this paper we describe techniques for clustering and matching identifier names that are both scalable and adaptive, in the sense that they can be trained to obtain better performance in a particular domain. An experimental evaluation on a number of sample datasets shows that the adaptive method sometimes performs much better than either of two non-adaptive baseline systems, and is nearly always competitive with the best baseline system.
机译:数据集成过程的一部分是确定哪些集合标识符引用相同的真实实体。在集成在网上的数据库或通过使用信息提取方法获得的数据库,通常可以通过在不同数据库中用于对象中使用的文本名称中的相似性来解决此问题。在本文中,我们描述了用于群集和匹配既可扩展和自适应的匹配标识符名称的技术,因为它们可以训练以获得特定域中的更好性能。关于许多样本数据集的实验评估表明,自适应方法有时比两个非自适应基线系统中的任何一个更好地执行,并且几乎始终与最佳基线系统具有竞争力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号