首页> 外文OA文献 >Finding nontrivial semantic matches between database schemas
【2h】

Finding nontrivial semantic matches between database schemas

机译:在数据库模式之间查找非平凡的语义匹配

摘要

Automation of schema matching has been under investigation for already some decades, still the systems usually do not find all matches or suggests incorrect matches. Due to this imperfection matching schemas it is still often done manually by domain experts. The rapidly increasing number of heterogeneous and distributed data sources in enterprises and on the web, the manual matching approach is more and more a limitation and the need for automating the schema matching process is increasingly important.This thesis describes the schema matching framework and prototype Map-IT, which is based on FlexiMatch. The schema matcher supports the multi-strategy approach, with each strategy represented as a Validator. Key characteristics of Map-IT are:• Map-IT and its Validators can learn from previous mappings.• Validator can easily be added to or selected from the Validator repository, in order to boost future matching performance or to adapt the system to the match task at hand.• Current Validators exploit different database information aspects.• Map-IT adapts the weights of the Validators to its environment using the Meta-Learner.An important limitation of Map-IT was that it did not search for nontrivial matches. Also it was not able to suggest matches with a complex local cardinality. One of the goals was to list and analyze what kind of nontrivial matches exist. In the thesis various match problems are addressed with multiple examples and categorized according to similarities in the correlation between the attribute semantics. Also the freedom and variety of database modeling complicates the schema matching problem.The substring match category represents matches which have duplicate substrings in the instance data of matching attributes which can be separated by delimiting characters. These matches can be spread-out over more attributes and may have a partial semantic overlap. No existing approach was found that solves this type of match problem. A new approach is developed that searches for likely linked record pairs, coping with schema unalignment and bad duplicates such as ambiguous words and stop words. For each record pair accompanying transformation functions are generated which contains String split and concatenate operations. From the set of transformation functions likely substring matches are mined using a clustering technique and a similarity value is calculated for each match. To each match a set of transformation functions is assigned. If a specific match has alternative transformation functions a ranked list is given.From evaluation of the substring validator turned out that, in various experiments done with realworld scenarios, it contributes substantially to a better performance of the schema matching prototype. The new validator copes with quite some dirty data present and was not very sensitive for the presence of incorrect linked records pairs. The feature that excludes unsuitable attributes is able to restrict the number of incorrect substring matches. In spite of current positive results during evaluation various recommendations are made that have the potential to improve the performance of the solution even more. Overall can be concluded that the chosen innovating approach, which not only uses transformation functions for explaining the semantic correlation in the match result but also for finding matches, is promising and might also be used solving other problem categories, e.g. the arithmetic relationships category. Also the transformation functions may be used during data integration. Besides the new validator, Map-IT is extended with the possibility to suggest and learn from feedback on suggested complex matches. Now the framework is able to handle complex matches new validators, that are able to produce complex match suggestions, can be plugged-in quite easily. Other match categories that are pointed-out in the thesis can1 be used as a “stepping stone” for future projects. The division of the nontrivial match problems in various sub-problems implies the necessity of a multi-strategy approach which is now also supported by the Map-IT framework for complex matches.
机译:模式匹配的自动化已经研究了数十年,但系统通常仍未找到所有匹配项或建议不正确的匹配项。由于这种不完善的匹配模式,它仍然经常由领域专家手动完成。随着企业和网络中异构数据源和分布式数据源数量的迅速增加,手动匹配方法越来越受限制,对模式匹配过程进行自动化的需求也越来越重要。本文描述了模式匹配框架和原型Map -IT,基于FlexiMatch。模式匹配器支持多策略方法,每种策略均表示为验证器。 Map-IT的主要特征是:•Map-IT及其验证器可以从以前的映射中学习。•可以轻松地将验证器添加到Validator存储库中或从中进行选择,以提高未来的匹配性能或使系统适应匹配当前的验证程序利用不同的数据库信息方面。Map-IT使用Meta-Learner使验证程序的权重适应其环境。Map-IT的一个重要限制是它不搜索非平凡的匹配项。同样,它也无法建议具有复杂本地基数的匹配。目标之一是列出并分析存在哪些非平凡匹配项。在本文中,通过多个示例解决了各种匹配问题,并根据属性语义之间相关性的相似性对它们进行了分类。数据库建模的自由度和多样性也使模式匹配问题复杂化。子字符串匹配类别表示在匹配属性的实例数据中具有重复子字符串的匹配项,这些子字符串可以通过定界字符分隔。这些匹配项可以扩展到更多属性上,并且可能具有部分语义重叠。找不到解决此类匹配问题的现有方法。开发了一种新方法,该方法搜索可能的链接记录对,以应对架构不对齐以及诸如单词和停用词之类的错误重复。对于每个记录对,都会生成包含字符串拆分和连接操作的转换函数。使用聚类技术从一组转换函数中挖掘可能的子字符串匹配项,并为每个匹配项计算相似性值。对于每个匹配,分配了一组转换函数。如果一个特定的匹配具有替代的转换功能,则将给出一个排序列表。通过对子字符串验证器的评估,可以发现,在现实场景中进行的各种实验中,它实质上有助于提高模式匹配原型的性能。新的验证器可以处理相当多的脏数据,并且对于不正确的链接记录对的存在不是很敏感。排除不合适的属性的功能可以限制不正确的子字符串匹配的数量。尽管目前在评估过程中取得了积极的成果,但仍提出了各种建议,这些建议有可能进一步提高解决方案的性能。总体上可以得出结论,所选择的创新方法不仅可以使用转换函数来解释匹配结果中的语义相关性,而且可以用于找到匹配项,这是有希望的,并且还可以用于解决其他问题类别,例如:算术关系类别。同样,可以在数据集成期间使用转换功能。除了新的验证程序外,Map-IT的扩展还可以建议和从建议的复杂匹配的反馈中学习。现在,该框架能够处理复杂的匹配项,并且可以很容易地插入新的验证器,这些验证器能够产生复杂的匹配项建议。论文中指出的其他匹配类别可以用作将来项目的“垫脚石”。将非平凡的比赛问题划分为各种子问题,意味着必须采用多策略方法,而Map-IT框架现在也支持这种方法,用于复杂比赛。

著录项

  • 作者

    Visser J.T.;

  • 作者单位
  • 年度 2007
  • 总页数
  • 原文格式 PDF
  • 正文语种 {"code":"en","name":"English","id":9}
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号