DS-Prox: Dataset Proximity Mining for Governing the Data Lake

机译：DS-Prox：用于管理数据湖的数据集接近挖掘

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

With the arrival of Data Lakes (DL) there is an increasing need for efficient dataset classification to support data analysis and information retrieval. Our goal is to use meta-features describing datasets to detect whether they are similar. We utilise a novel proximity mining approach to assess the similarity of datasets. The proximity scores are used as an efficient first step, where pairs of datasets with high proximity are selected for further time-consuming schema matching and dedupli-cation. The proposed approach helps in early-pruning unnecessary computations, thus improving the efficiency of similar-schema search. We evaluate our approach in experiments using the OpenML online DL, which shows significant efficiency gains above 25% compared to matching without early-pruning, and recall rates reaching higher than 90% under certain scenarios.

机译：随着数据湖泊的到来（DL），越来越需要有效的数据集分类来支持数据分析和信息检索。我们的目标是使用描述数据集的元特征来检测它们是否相似。我们利用了一种新的接近挖掘方法来评估数据集的相似性。接近分数用作有效的第一步，其中选择具有高接近度的数据集对进行进一步耗时的模式匹配和Dedupli-阳离子。所提出的方法有助于提高不必要的计算，从而提高类似模式搜索的效率。我们使用OpenML Online DL评估我们在实验中的方法，这与在没有早期修剪的匹配相比上显示出高于25％的显着效率，并且在某些情况下召回率高于90％的速率高于90％。

著录项

来源
《International conference on similarity search and applications》|2017年|xi 332 p.|共16页
会议地点
作者
Ayman Alserafi; Toon Calders; Alberto Abello; Oscar Romero;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类理论、方法;
关键词

相似文献

外文文献
中文文献
专利

1. Keeping the Data Lake in Form: Proximity Mining for Pre-Filtering Schema Matching [J] . AYMAN ALSERAFI, ALBERTO ABELLO, OSCAR ROMERO, ACM Transactions on Information Systems . 2020,第3期

机译：保持数据湖的形式：接近过滤模式匹配的近距离挖掘
2. Trends Associated with Hemorrhoids in Japan: Data Mining of Medical Information Datasets and the National Database of Health Insurance Claims and Specific Health Checkups of Japan (NDB) Open Data Japan [J] . Mukai Ririka, Shimada Kazuyo, Suzuki Takaaki, Biological & pharmaceutical bulletin . 2020,第12期

机译：与日本痔疮相关的趋势：医疗信息数据集的数据挖掘和日本的健康保险索赔和特定健康检查数据库（NDB）日本
3. A Semi‐automated Approach to Create Purposeful Mechanistic Datasets from Heterogeneous Data: Data Mining Towards the in silico in silico Predictions for Oestrogen Receptor Modulation and Teratogenicity [J] . Bashir?Surfraz M., Fowkes Adrian, Plante Jeffrey P. Molecular informatics . 2017,第8期

机译：从异质数据创建有目的地机械数据集的半自动方法：雌激素预测中的硅化的数据挖掘雌激素受体调节和致畸性
4. DS-Prox: Dataset Proximity Mining for Governing the Data Lake [C] . Ayman Alserafi, Toon Calders, Alberto Abello, International conference on similarity search and applications . 2017

机译：DS-Prox：用于治理Data Lake的数据集邻近挖掘
5. Mining massive moving object datasets from RFID flow analysis to traffic mining [D] . Gonzalez, Hector 2008

机译：从RFID流量分析到流量挖掘，挖掘海量移动物体数据集
6. HRT Atlas v1.0 database: redefining human and mouse housekeeping genes and candidate reference transcripts by mining massive RNA-seq datasets [O] . Bidossessi Wilfried Hounkpe, Francine Chenou, Franciele de Lima, 2021

机译：HRT ATLAS V1.0数据库：通过挖掘大规模RNA-SEQ数据集重新定义人员和鼠标管家基因和候选参考转录物
7. Mining Massive Moving Object Datasets from RFID Data Flow Analysis to Traffic Mining [O] . Gonzalez Hector 2008

机译：从RFID数据流分析到交通挖掘的大规模运动目标数据集挖掘
8. How Does Mediterranean Basin's Atmosphere Become Weak Moisture Source During Negative Phase of NAO: Use of AIRS, AMSR, TOVS, & TRMM Satellite Datasets Over Last Two NAO Cycles to Examine Governing Controls on E-P [R] . Smith, E. A., Mehta, A. V. 2008

机译：在NaO的负相期间，地中海盆地的大气如何成为弱水源：在最后两个NaO周期中使用aIRs，amsR，TOVs和TRmm卫星数据集来检查E-p上的控制对照

DS-Prox: Dataset Proximity Mining for Governing the Data Lake

摘要

著录项

相似文献

相关主题

期刊订阅