【24h】

Instance-Based 'One-to-Some' Assignment of Similarity Measures to Attributes (Short Paper)

机译:对属性的基于相似度的基于实例的“一对一”分配(简短论文)

获取原文

摘要

Data quality is a key factor for economical success. It is usually defined as a set of properties of data, such as completeness, accessibility, relevance, and conciseness. The latter includes the absence of multiple representations for same real world objects. To avoid such duplicates, there is a wide range of commercial products and customized self-coded software. These programs can be quite expensive both in acquisition and maintenance. In particular, small and medium-sized companies cannot afford these tools. Moreover, it is difficult to set up and tune all necessary parameters in these programs. Recently, web-based applications for duplicate detection have emerged. However, they are not easy to integrate into the local IT landscape and require much manual configuration effort. With DAQS (Data Quality as a Service) we present a novel approach to support duplicate detection. The approach features (1) minimal required user interaction and (2) self-configuration for the provided input data. To this end, each data cleansing task is classified to find out which metadata is available. Next, similarity measures are automatically assigned to the provided records' attributes and a duplicate detection process is carried out. In this paper we introduce a novel matching approach, called one-to-some or l:k assignment, to assign similarity measures to attributes. We performed an extensive evaluation on a large training corpus and ten test datasets of address data and achieved promising results.
机译:数据质量是经济成功的关键因素。通常将其定义为一组数据属性,例如完整性,可访问性,相关性和简洁性。后者包括对同一真实世界对象没有多种表示形式。为了避免这种重复,存在各种各样的商业产品和定制的自编码软件。这些程序在购置和维护方面都可能非常昂贵。特别是中小型公司买不起这些工具。此外,很难在这些程序中设置和调整所有必要的参数。最近,出现了用于重复检测的基于Web的应用程序。但是,将它们集成到本地IT环境并不容易,并且需要大量的手动配置工作。借助DAQS(数据质量即服务),我们提出了一种新颖的方法来支持重复检测。该方法具有(1)所需的最少用户交互和(2)所提供输入数据的自配置功能。为此,对每个数据清理任务进行分类以找出哪些元数据可用。接下来,将相似性度量自动分配给提供的记录的属性,并执行重复检测过程。在本文中,我们介绍了一种新颖的匹配方法,称为一对多或l:k分配,可以为属性分配相似性度量。我们对大型培训语料库和地址数据的十个测试数据集进行了广泛的评估,并取得了可喜的结果。

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号