首页> 外文会议>ACMKDD International Conference on Knowledge Discovery and Data Mining;KDD 2008 >Automatic Record Linkage Using Seeded Nearest Neighbour and Support Vector Machine Classification
【24h】

Automatic Record Linkage Using Seeded Nearest Neighbour and Support Vector Machine Classification

机译:使用种子最近的邻居和支持向量机分类自动记录链接

获取原文

摘要

The task of linking databases is an important step in an increasing number of data mining projects, because linked data can contain information that is not available otherwise, or that would require time-consuming and expensive collection of specific data. The aim of linking is to match and aggregate all records that refer to the same entity. One of the major challenges when linking large databases is the efficient and accurate classification of record pairs into matches and non-matches. While traditionally classification was based on manually-set thresholds or on statistical procedures, many of the more recently developed classification methods are based on supervised learning techniques. They therefore require training data, which is often not available in real world situations or has to be prepared manually, an expensive, cumbersome and time-consuming process.The author has previously presented a novel two-step approach to automatic record pair classification [6, 7]. In the first step of this approach, training examples of high quality are automatically selected from the compared record pairs, and used in the second step to train a support vector machine (SVM) classifier. Initial experiments showed the feasibility of the approach, achieving results that outperformed k-means clustering. In this paper, two variations of this approach are presented. The first is based on a nearest-neighbour classifier, while the second improves a SVM classifier by iteratively adding more examples into the training sets. Experimental results show that this two-step approach can achieve better classification results than other unsuper-vised approaches.
机译:在越来越多的数据挖掘项目中,链接数据库的任务是重要的一步,因为链接的数据可能包含否则无法获得的信息,或者需要耗时且昂贵的特定数据收集。链接的目的是匹配和汇总引用同一实体的所有记录。链接大型数据库时的主要挑战之一是将记录对有效且准确地分类为匹配项和不匹配项。传统上,分类是基于手动设置的阈值或统计程序,而许多最近开发的分类方法都是基于监督学习技术。因此,他们需要训练数据,这在现实世界中通常是不可用的,或者必须手动准备,这是一个昂贵,麻烦且耗时的过程。 作者之前已经提出了一种新颖的两步式自动记录对分类方法[6,7]。在此方法的第一步中,从比较的记录对中自动选择高质量的训练示例,并在第二步中将其用于训练支持向量机(SVM)分类器。最初的实验证明了该方法的可行性,取得了优于k均值聚类的结果。在本文中,介绍了此方法的两个变体。第一种基于最近邻分类器,而第二种则通过迭代地将更多示例添加到训练集中来改进SVM分类器。实验结果表明,与其他非监督方法相比,该两步方法可以实现更好的分类结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号