Keeping the Data Lake in Form: DS-kNN Datasets Categorization Using Proximity Mining

机译：保持数据湖的形式：使用邻近挖掘的DS-kNN数据集分类

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

With the growth of the number of datasets stored in data repositories, there has been a trend of using Data Lakes (DLs) to store such data. DLs store datasets in their raw formats without any transformations or preprocessing, with accessibility available using schema-on-read. This makes it difficult for analysts to find datasets that can be crossed and that belong to the same topic. To support them in this DL governance challenge, we propose in this paper an algorithm for categorizing datasets in the DL into pre-defined topic-wise categories of interest. We utilise a k-NN approach for this task which uses a proximity score for computing similarities of datasets based on metadata. We test our algorithm on a real-life DL with a known ground-truth categorization. Our approach is successful in detecting the correct categories for datasets and outliers with a precision of more than 90% and recall rates exceeding 75% in specific settings.

机译：随着存储在数据存储库中的数据集数量的增长，存在使用数据湖（DL）存储此类数据的趋势。 DL以原始格式存储数据集，而无需进行任何转换或预处理，并且可使用读取模式来访问。这使得分析人员很难找到可以交叉且属于同一主题的数据集。为了在DL治理挑战中为他们提供支持，我们在本文中提出了一种算法，用于将DL中的数据集归类为预定义的感兴趣的主题类别。我们利用k-NN方法完成这项任务，该方法使用接近度得分来计算基于元数据的数据集的相似度。我们使用真实的分类对真实生活的DL进行了算法测试。我们的方法成功地为数据集和异常值检测了正确的类别，在特定设置下的准确率超过90％，召回率超过75％。

著录项

来源
《International conference on model and data engineering》|2019年|35-49|共15页
会议地点
作者
Ayman Alserafi; Alberto Abello; Oscar Romero; Toon Calders;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Data lake categorization; k-Nearest-Neighbour; Metadata management; Proximity mining;

机译：数据湖分类; k最近邻;元数据管理;邻近挖矿;

相似文献

外文文献
中文文献
专利

1. Keeping the Data Lake in Form: Proximity Mining for Pre-Filtering Schema Matching [J] . AYMAN ALSERAFI, ALBERTO ABELLO, OSCAR ROMERO, ACM Transactions on Information Systems . 2020,第3期

机译：保持数据湖的形式：接近过滤模式匹配的近距离挖掘
2. Trends Associated with Hemorrhoids in Japan: Data Mining of Medical Information Datasets and the National Database of Health Insurance Claims and Specific Health Checkups of Japan (NDB) Open Data Japan [J] . Mukai Ririka, Shimada Kazuyo, Suzuki Takaaki, Biological & pharmaceutical bulletin . 2020,第12期

机译：与日本痔疮相关的趋势：医疗信息数据集的数据挖掘和日本的健康保险索赔和特定健康检查数据库（NDB）日本
3. Advances in Cheminformatics Methodologies and Infrastructure to Support the Data Mining of Large, Heterogeneous Chemical Datasets [J] . Rajarshi Guha, Kevin Gilbert, Geoffrey Fox, Current computer-aided drug design . 2010,第1期

机译：支持大型，异构化学数据集数据挖掘的化学信息学方法论和基础设施方面的进展
4. Keeping the Data Lake in Form: DS-kNN Datasets Categorization Using Proximity Mining [C] . Ayman Alserafi, Alberto Abello, Oscar Romero, International conference on model and data engineering . 2019

机译：以表单保持数据湖：DS-KNN数据集使用邻近挖掘分类
5. Mining massive moving object datasets from RFID flow analysis to traffic mining [D] . Gonzalez, Hector 2008

机译：从RFID流量分析到流量挖掘，挖掘海量移动物体数据集
6. Using DICOM Metadata for Radiological Image Series Categorization: a Feasibility Study on Large Clinical Brain MRI Datasets [O] . Romane Gauriau, Christopher Bridge, Lina Chen, 2020

机译：使用DICOM元数据进行放射图像系列分类：大型临床脑MRI数据集的可行性研究
7. High performance frequent subgraph mining on transaction datasets: A survey and performance comparison [O] . Bismita S. Jena, Cynthia Khan, Rajshekhar Sunderraman 2019

机译：交易数据集的高性能频繁子图挖掘：调查和性能比较
8. Development Work for Improved Heavy-Duty Vehicle Modeling Capability Data Mining, FHWA Datasets [R] . Lindhjem, C. E. , Shepard, S. 2007

机译：改进的重型车辆建模能力数据挖掘，FHWa数据集的开发工作

Keeping the Data Lake in Form: DS-kNN Datasets Categorization Using Proximity Mining

摘要

著录项

相似文献

相关主题

期刊订阅