Automatic Record Linkage Using Seeded Nearest Neighbour and Support Vector Machine Classification

机译：使用种子最近的邻居和支持向量机分类自动记录链接

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The task of linking databases is an important step in an increasing number of data mining projects, because linked data can contain information that is not available otherwise, or that would require time-consuming and expensive collection of specific data. The aim of linking is to match and aggregate all records that refer to the same entity. One of the major challenges when linking large databases is the efficient and accurate classification of record pairs into matches and non-matches. While traditionally classification was based on manually-set thresholds or on statistical procedures, many of the more recently developed classification methods are based on supervised learning techniques. They therefore require training data, which is often not available in real world situations or has to be prepared manually, an expensive, cumbersome and time-consuming process.The author has previously presented a novel two-step approach to automatic record pair classification [6, 7]. In the first step of this approach, training examples of high quality are automatically selected from the compared record pairs, and used in the second step to train a support vector machine (SVM) classifier. Initial experiments showed the feasibility of the approach, achieving results that outperformed k-means clustering. In this paper, two variations of this approach are presented. The first is based on a nearest-neighbour classifier, while the second improves a SVM classifier by iteratively adding more examples into the training sets. Experimental results show that this two-step approach can achieve better classification results than other unsuper-vised approaches.

机译：在越来越多的数据挖掘项目中，链接数据库的任务是重要的一步，因为链接的数据可能包含否则无法获得的信息，或者需要耗时且昂贵的特定数据收集。链接的目的是匹配和汇总引用同一实体的所有记录。链接大型数据库时的主要挑战之一是将记录对有效且准确地分类为匹配项和不匹配项。传统上，分类是基于手动设置的阈值或统计程序，而许多最近开发的分类方法都是基于监督学习技术。因此，他们需要训练数据，这在现实世界中通常是不可用的，或者必须手动准备，这是一个昂贵，麻烦且耗时的过程。作者之前已经提出了一种新颖的两步式自动记录对分类方法[6，7]。在此方法的第一步中，从比较的记录对中自动选择高质量的训练示例，并在第二步中将其用于训练支持向量机（SVM）分类器。最初的实验证明了该方法的可行性，取得了优于k均值聚类的结果。在本文中，介绍了此方法的两个变体。第一种基于最近邻分类器，而第二种则通过迭代地将更多示例添加到训练集中来改进SVM分类器。实验结果表明，与其他非监督方法相比，该两步方法可以实现更好的分类结果。

著录项

来源
《ACMKDD International Conference on Knowledge Discovery and Data Mining;KDD 2008》|2008年|133-141|共9页
会议地点
作者
Peter Christen;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类信息与知识传播;
关键词
data matching; data linkage; deduplication; entity resolu tion; nearest neighbour; support vector machine;

机译：数据匹配;数据链接;重复数据删除;实体解析;最近邻;支持向量机;

相似文献

外文文献
中文文献
专利

1. Analysis and identification of kidney stone using K th nearest neighbour (KNN) and support vector machine (SVM) classification techniques [J] . Jyoti Verma, Madhwendra Nath, Priyanshu Tripathi, Pattern recognition and image analysis: advances in mathematical theory and applications in the USSR . 2017,第3期

机译：k Th 最近邻（knn）和支持向量机（SVM）分类技术的分析和识别
2. Analysis and Identification of Kidney Stone Using Kth Nearest Neighbour (KNN) and Support Vector Machine (SVM) Classification Techniques1 [J] . Jyoti Verma, Madhwendra Nath, Priyanshu Tripathi, Pattern recognition and image analysis: advances in mathematical theory and applications in the USSR . 2017,第3期

机译：kth最近邻（knn）和支持向量机（SVM）分类技术的分析与识别肾结石
3. Assessing the performance of a modified S-transform with probabilistic neural network, support vector machine and nearest neighbour classifiers for single and multiple power quality disturbances identification [J] . Shamachurn Heman Neural computing & applications . 2019,第4期

机译：用概率神经网络评估修改的S转换的性能，支持单个和多功能质量干扰识别的矢量机和最近邻分类器
4. Automatic record linkage using seeded nearest neighbour and support vector machine classification [C] . Peter Christen ACM SIGKDD international conference on Knowledge discovery and data mining . 2008

机译：使用种子最近邻居和支持向量机分类自动记录链接
5. Comparative classification of prostate cancer data using the Support Vector Machine, Random Forest, DualKS and k-Nearest Neighbours. [D] . Sakouvogui, Kekoura. 2015

机译：使用支持向量机，Random Forest，DualKS和k-Nearest邻居对前列腺癌数据进行比较分类。
6. Comparison of Random Forest k-Nearest Neighbor and Support Vector Machine Classifiers for Land Cover Classification Using Sentinel-2 Imagery [O] . Phan Thanh Noi, Martin Kappas 2018

机译：使用Sentinel-2影像进行土地覆盖分类的随机森林k最近邻和支持向量机分类器的比较
7. Automatic Record Linkage using Seeded Nearest Neighbour and Support Vector Machine Classification [O] . Peter Christen 2008

机译：使用种子最近邻和支持向量机分类的自动记录链接

Automatic Record Linkage Using Seeded Nearest Neighbour and Support Vector Machine Classification

摘要

著录项

相似文献

相关主题

期刊订阅