Automation of schema matching has been under investigation for already some decades, still the systems usually do not find all matches or suggests incorrect matches. Due to this imperfection matching schemas it is still often done manually by domain experts. The rapidly increasing number of heterogeneous and distributed data sources in enterprises and on the web, the manual matching approach is more and more a limitation and the need for automating the schema matching process is increasingly important.This thesis describes the schema matching framework and prototype Map-IT, which is based on FlexiMatch. The schema matcher supports the multi-strategy approach, with each strategy represented as a Validator. Key characteristics of Map-IT are:• Map-IT and its Validators can learn from previous mappings.• Validator can easily be added to or selected from the Validator repository, in order to boost future matching performance or to adapt the system to the match task at hand.• Current Validators exploit different database information aspects.• Map-IT adapts the weights of the Validators to its environment using the Meta-Learner.An important limitation of Map-IT was that it did not search for nontrivial matches. Also it was not able to suggest matches with a complex local cardinality. One of the goals was to list and analyze what kind of nontrivial matches exist. In the thesis various match problems are addressed with multiple examples and categorized according to similarities in the correlation between the attribute semantics. Also the freedom and variety of database modeling complicates the schema matching problem.The substring match category represents matches which have duplicate substrings in the instance data of matching attributes which can be separated by delimiting characters. These matches can be spread-out over more attributes and may have a partial semantic overlap. No existing approach was found that solves this type of match problem. A new approach is developed that searches for likely linked record pairs, coping with schema unalignment and bad duplicates such as ambiguous words and stop words. For each record pair accompanying transformation functions are generated which contains String split and concatenate operations. From the set of transformation functions likely substring matches are mined using a clustering technique and a similarity value is calculated for each match. To each match a set of transformation functions is assigned. If a specific match has alternative transformation functions a ranked list is given.From evaluation of the substring validator turned out that, in various experiments done with realworld scenarios, it contributes substantially to a better performance of the schema matching prototype. The new validator copes with quite some dirty data present and was not very sensitive for the presence of incorrect linked records pairs. The feature that excludes unsuitable attributes is able to restrict the number of incorrect substring matches. In spite of current positive results during evaluation various recommendations are made that have the potential to improve the performance of the solution even more. Overall can be concluded that the chosen innovating approach, which not only uses transformation functions for explaining the semantic correlation in the match result but also for finding matches, is promising and might also be used solving other problem categories, e.g. the arithmetic relationships category. Also the transformation functions may be used during data integration. Besides the new validator, Map-IT is extended with the possibility to suggest and learn from feedback on suggested complex matches. Now the framework is able to handle complex matches new validators, that are able to produce complex match suggestions, can be plugged-in quite easily. Other match categories that are pointed-out in the thesis can1 be used as a “stepping stone” for future projects. The division of the nontrivial match problems in various sub-problems implies the necessity of a multi-strategy approach which is now also supported by the Map-IT framework for complex matches.
展开▼