首页> 外文会议>International conference on similarity search and applications >DS-Prox: Dataset Proximity Mining for Governing the Data Lake
【24h】

DS-Prox: Dataset Proximity Mining for Governing the Data Lake

机译:DS-Prox:用于管理数据湖的数据集接近挖掘

获取原文

摘要

With the arrival of Data Lakes (DL) there is an increasing need for efficient dataset classification to support data analysis and information retrieval. Our goal is to use meta-features describing datasets to detect whether they are similar. We utilise a novel proximity mining approach to assess the similarity of datasets. The proximity scores are used as an efficient first step, where pairs of datasets with high proximity are selected for further time-consuming schema matching and dedupli-cation. The proposed approach helps in early-pruning unnecessary computations, thus improving the efficiency of similar-schema search. We evaluate our approach in experiments using the OpenML online DL, which shows significant efficiency gains above 25% compared to matching without early-pruning, and recall rates reaching higher than 90% under certain scenarios.
机译:随着数据湖泊的到来(DL),越来越需要有效的数据集分类来支持数据分析和信息检索。我们的目标是使用描述数据集的元特征来检测它们是否相似。我们利用了一种新的接近挖掘方法来评估数据集的相似性。接近分数用作有效的第一步,其中选择具有高接近度的数据集对进行进一步耗时的模式匹配和Dedupli-阳离子。所提出的方法有助于提高不必要的计算,从而提高类似模式搜索的效率。我们使用OpenML Online DL评估我们在实验中的方法,这与在没有早期修剪的匹配相比上显示出高于25%的显着效率,并且在某些情况下召回率高于90%的速率高于90%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号