首页> 外文期刊>ACM transactions on database systems >Automatic Complex Schema Matching Across Web Query Interfaces: A Correlation Mining Approach
【24h】

Automatic Complex Schema Matching Across Web Query Interfaces: A Correlation Mining Approach

机译:跨Web查询接口的自动复杂模式匹配:一种关联挖掘方法

获取原文
获取原文并翻译 | 示例

摘要

To enable information integration, schema matching is a critical step for discovering semantic correspondences of attributes across heterogeneous sources. While complex matchings are common, because of their far more complex search space, most existing techniques focus on simple 1:1 matchings. To tackle this challenge, this article takes a conceptually novel approach by viewing schema matching as correlation mining, for our task of matching Web query interfaces to integrate the myriad databases on the Internet. On this "deep Web," query interfaces generally form complex matchings between attribute groups (e.g., {author} corresponds to {first name, last name} in the Books domain). We observe that the co-occurrences patterns across query interfaces often reveal such complex semantic relationships: grouping attributes (e.g., {first name, last name}) tend to be co-present in query interfaces and thus positively correlated. In contrast, synonym attributes are negatively correlated because they rarely co-occur. This insight enables us to discover complex matchings by a correlation mining approach. In particular, we develop the DCM framework, which consists of data preprocessing, dual mining of positive and negative correlations, and finally matching construction. We evaluate the DCM framework on manually extracted interfaces and the results show good accuracy for discovering complex matchings. Further, to automate the entire matching process, we incorporate automatic techniques for interface extraction. Executing the DCM framework on automatically extracted interfaces, we find that the inevitable errors in automatic interface extraction may significantly affect the matching result. To make the DCM framework robust against such "noisy" schemas, we integrate it with a novel "ensemble" approach, which creates an ensemble of DCM matchers, by randomizing the schema data into many trials and aggregating their ranked results by taking majority voting. As a principled basis, we provide analytic justification of the robustness of the ensemble approach. Empirically, our experiments show that the "ensemblization" indeed significantly boosts the matching accuracy, over automatically extracted and thus noisy schema data. By employing the DCM framework with the ensemble approach, we thus complete an automatic process of matchings Web query interfaces.
机译:为了实现信息集成,模式匹配是发现跨异构源的属性的语义对应关系的关键步骤。尽管复杂的匹配很常见,但是由于它们的搜索空间更加复杂,所以大多数现有技术都集中在简单的1:1匹配上。为了解决这一挑战,本文采用了一种概念上新颖的方法,即将模式匹配视为关联挖掘,以完成我们匹配Web查询接口以集成Internet上众多数据库的任务。在这种“深层Web”上,查询界面通常会在属性组之间形成复杂的匹配关系(例如,{author}对应于Books域中的{first name,lastname})。我们观察到,查询接口之间的共现模式通常会揭示这种复杂的语义关系:分组属性(例如{名,姓})倾向于在查询接口中共存,因此呈正相关。相反,同义词属性具有负相关性,因为它们很少同时出现。这种洞察力使我们能够通过相关挖掘方法发现复杂的匹配项。特别是,我们开发了DCM框架,该框架包括数据预处理,正相关和负相关的双重挖掘以及最终的匹配构造。我们在手动提取的界面上评估了DCM框架,结果显示发现复杂匹配的准确性很高。此外,为了使整个匹配过程自动化,我们采用了自动技术进行接口提取。在自动提取的接口上执行DCM框架,我们发现自动接口提取中不可避免的错误可能会严重影响匹配结果。为了使DCM框架能够抵御此类“嘈杂”模式,我们将其与新颖的“集成”方法集成在一起,该方法通过将模式数据随机分为多个试验并通过进行多数表决来汇总其排名结果,从而创建了DCM匹配器的集成。作为原则基础,我们提供了集成方法的鲁棒性的解析论证。从经验上讲,我们的实验表明,“集合化”确实大大提高了匹配精度,超过了自动提取的且因此嘈杂的模式数据。通过将DCM框架与整体方法结合使用,我们完成了匹配Web查询界面的自动过程。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号