一种基于离群点检测的自动实体匹配方法

樊峰峰; 李战怀; 陈群; 刘海龙

首页> 中文期刊> 《计算机学报》 >一种基于离群点检测的自动实体匹配方法

一种基于离群点检测的自动实体匹配方法

开具论文收录证明 >>

期刊封面封底目录下载 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Entity Matching,also known as Record Matching,is a key technique in data integration and cleaning process.Its typical applications include the commercial products matching across different websites and the research paper records matching between the DBLP (Digital Bibliorgrophy & Library Project) and Scholar digital libraries.The widespread data quality defects in real data,e.g.,tuple errors,missing values and representation diversities,make the entity matching problem much challenging.The popular entity matching algorithms can be categorized into rule-based,probabilistic and learning-based approaches.In e-commercial data,the descriptions of the same products may vary greatly.For the entity matching task on those datasets with representation diversity problems,it is difficult to design effective matching rules and remains challenging to train classification models.To address this issue,this paper proposes an Outlier-Detection-based approach,denoted by ODetec,for automatic entity matching.Firstly,the ODetec measures the similarities on the matching attributes for each record pair,and map the pairs into points in feature space.Then it calculates the outlier distances for each record pair in the feature space.Finally,it ranks the pairs by their outlier distances and extracts those matching candidates that meet the matching constraints.In addition,ODetec can transform multiple co-related matching features into orthogonal principal components by Principal Component Analysis,breaking through the limitation of conditional independence between attributes that is required by Fellegi-Sunter model.Thus it reaches better effect and broader applicability.Our extensive experiments on real datasets have verifiedthe effectiveness of the ODetec approach.%实体匹配也叫记录匹配,是数据集成与数据清洗过程中的一项关键技术.其典型用例包括不同网站之间的商品匹配以及DBLP(Digital Bibliorgrophy & Library Project)与Scholar文献数据库之间的文献实体匹配.真实数据中广泛存在的数据质量缺陷,如错误值、缺失值和数据表达形式多样性等数据质量问题,使得实体匹配问题很具挑战性.目前流行的实体匹配算法可划分为三大类:基于规则的、基于概率的和基于学习的.电商数据中,对同一商品的描述可能差异巨大.对于这类充满表达多样性的实体匹配问题,通常并不存在简洁高效的匹配规则,训练精准的分类模型也很困难.针对这个问题,文中提出了一种基于离群点检测(Outlier Detection)的自动实体匹配方法,记为ODetec算法.首先计算记录序偶在匹配属性上的相似度,并将序偶映射为特征空间上的点;接着在特征空间中估算每个序偶的离群距离;最后根据离群距离和匹配约束,抽取匹配序偶.另外,ODetec算法采用主成分分析方法将多个存在相关性的匹配特征变换为彼此正交的主成分,突破了Fellegi-Sunter模型中属性之间须满足条件独立假设的限制,具备了更好的匹配效果和更为广泛的适用性.实验结论证实了ODetec方法的有效性.

著录项

来源
《计算机学报》 |2017年第10期|2197-2211|共15页
作者
樊峰峰; 李战怀; 陈群; 刘海龙;
展开▼
作者单位

西北工业大学计算机学院西安710072;

西北工业大学计算机学院西安710072;

西北工业大学计算机学院西安710072;

西北工业大学计算机学院西安710072;

展开▼
原文格式 PDF
正文语种 chi
中图分类程序设计、软件工程;
关键词
数据集成; 实体匹配; 数据质量; 离群点检测; 主成分分析;

相似文献

中文文献
外文文献
专利

1. 基于逆向匹配的电子商务网站实体模板半自动构建方法 [J] . 傅彦 ,徐昭邦 ,夏虎 . 中文信息学报 . 2015,第002期
2. 大数据环境下一种基于模式匹配的实体统一方法 [J] . 熊安萍 ,詹妮 ,邹毅 . 计算机应用与软件 . 2018,第008期
3. 一种基于实体匹配的面要素无缝拼接方法及精度分析 [J] . 杨爱 ,王发良 ,朱秀丽 . 地理信息世界 . 2012,第003期
4. 一种基于BP神经网络的实体匹配方法 [J] . 陈凌 ,强保华 ,余建桥 . 计算机应用研究 . 2006,第012期
5. 一种基于自动特征权值的实体相似度计算方法 [J] . 刘杰 . 重庆科技学院学报（自然科学版） . 2014,第003期
6. 一种模式匹配和实体统一相互促进的方法 [C] . 潘峰 ,李庆忠 ,董永权 . 第六届全国Web信息系统及其应用学术会议、第四届全国语义Web与本体论学术研讨会、第三届全国电子政务技术及应用学术研讨会 . 2009
7. 基于实体上下文表示学习的实体匹配方法研究 [A] . 许亮 . 2020

一种基于离群点检测的自动实体匹配方法

摘要

著录项

相似文献

相关主题

期刊订阅