【24h】

Making holistic schema matching robust

机译:使整体模式匹配更健壮

获取原文

摘要

The Web has been rapidly "deepened" by myriad searchable databases online, where data are hidden behind query interfaces. As an essential task toward integrating these massive "deep Web" sources, large scale schema matching (i.e., discovering semantic correspondences of attributes across many query interfaces) has been actively studied recently. In particular, many works have emerged to address this problem by "holistically" matching many schemas at the same time and thus pursuing "mining" approaches in nature. However, while holistic schema matching has built its promise upon the large quantity of input schemas, it also suffers the robustness problem caused by noisy data quality. Such noises often inevitably arise in the automatic extraction of schema data, which is mandatory in large scale integration. For holistic matching to be viable, it is thus essential to make it robust against noisy schemas. To tackle this challenge, we propose a data-ensemble framework with samplingand voting techniques, which is inspired by bagging predictors. Specifically, our approach creates an ensemble of matchers, by randomizing input schema data into many independently downsampled trials, executing the same matcher on each trial and then aggregating their ranked results by taking majority voting. As a principled basis, we provide analytic justification of the effectiveness of this data-ensemble framework. Further, empirically, our experiments on real Web data show that the "ensemblization" indeed significantly boosts the matching accuracy under noisy schema input, and thus maintains the desired robustness of a holistic matcher.
机译:无数可搜索的在线数据库使Web迅速“加深”,其中数据隐藏在查询界面的后面。最近,作为整合这些庞大的“深度Web”资源的一项重要任务,大规模模式匹配(即发现许多查询接口之间的属性的语义对应关系)。尤其是,出现了许多通过同时“整体”匹配多个模式并因此在自然界中追求“挖掘”方法来解决此问题的工作。但是,尽管整体模式匹配在大量输入模式上建立了自己的诺言,但它也遭受了噪声数据质量引起的鲁棒性问题。这种噪声通常不可避免地出现在模式数据的自动提取中,这在大规模集成中是必不可少的。为了使整体匹配可行,因此必须使其对于嘈杂的模式具有鲁棒性。为了应对这一挑战,我们提出了一个采用采样和投票技术的数据集成框架,该框架的灵感来自装袋预测变量。具体来说,我们的方法通过将输入模式数据随机分为许多独立的降采样的试验,在每个试验中执行相同的匹配器,然后通过进行多数表决来汇总其排名结果,从而创建一个匹配器集合。作为原则基础,我们提供此数据集成框架有效性的分析依据。此外,凭经验,我们在真实Web数据上的实验表明,“整合”确实可以显着提高在嘈杂模式输入下的匹配精度,从而保持整体匹配器的理想鲁棒性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号