首页> 外文会议>International conference on model and data engineering >Keeping the Data Lake in Form: DS-kNN Datasets Categorization Using Proximity Mining
【24h】

Keeping the Data Lake in Form: DS-kNN Datasets Categorization Using Proximity Mining

机译:保持数据湖的形式:使用邻近挖掘的DS-kNN数据集分类

获取原文

摘要

With the growth of the number of datasets stored in data repositories, there has been a trend of using Data Lakes (DLs) to store such data. DLs store datasets in their raw formats without any transformations or preprocessing, with accessibility available using schema-on-read. This makes it difficult for analysts to find datasets that can be crossed and that belong to the same topic. To support them in this DL governance challenge, we propose in this paper an algorithm for categorizing datasets in the DL into pre-defined topic-wise categories of interest. We utilise a k-NN approach for this task which uses a proximity score for computing similarities of datasets based on metadata. We test our algorithm on a real-life DL with a known ground-truth categorization. Our approach is successful in detecting the correct categories for datasets and outliers with a precision of more than 90% and recall rates exceeding 75% in specific settings.
机译:随着存储在数据存储库中的数据集数量的增长,存在使用数据湖(DL)存储此类数据的趋势。 DL以原始格式存储数据集,而无需进行任何转换或预处理,并且可使用读取模式来访问。这使得分析人员很难找到可以交叉且属于同一主题的数据集。为了在DL治理挑战中为他们提供支持,我们在本文中提出了一种算法,用于将DL中的数据集归类为预定义的感兴趣的主题类别。我们利用k-NN方法完成这项任务,该方法使用接近度得分来计算基于元数据的数据集的相似度。我们使用真实的分类对真实生活的DL进行了算法测试。我们的方法成功地为数据集和异常值检测了正确的类别,在特定设置下的准确率超过90%,召回率超过75%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号