首页> 美国卫生研究院文献>other >FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data
【2h】

FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data

机译:FDTool:一个Python应用程序用于挖掘表格数据中的功能依赖性和候选键

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Functional dependencies (FDs) and candidate keys are essential for table decomposition, database normalization, and data cleansing. In this paper, we present FDTool, a command line Python application to discover minimal FDs in tabular datasets and infer equivalent attribute sets and candidate keys from them. The runtime and memory costs associated with seven published FD discovery algorithms are given with an overview of their theoretical foundations. Previous research establishes that FD_Mine is the most efficient FD discovery algorithm when applied to datasets with many rows (> 100,000 rows) and few columns (< 14 columns). This puts it in a special position to rule mine clinical and demographic datasets, which often consist of long and narrow sets of participant records. The structure of FD_Mine is described and supplemented with a formal proof of the equivalence pruning method used. FDTool is a re-implementation of FD_Mine with additional features added to improve performance and automate typical processes in database architecture. The experimental results of applying FDTool to 13 datasets of different dimensions are summarized in terms of the number of FDs checked, the number of FDs found, and the time it takes for the code to terminate. We find that the number of attributes in a dataset has a much greater effect on the runtime and memory costs of FDTool than does row count. The last section explains in detail how the FDTool application can be accessed, executed, and further developed.
机译:功能依赖项(FD)和候选键对于表分解,数据库规范化和数据清理至关重要。在本文中,我们介绍了FDTool,这是一个命令行Python应用程序,用于发现表格数据集中的最小FD并从中推断出等效的属性集和候选键。给出了与七个已发布的FD发现算法相关的运行时和内存成本,并概述了其理论基础。先前的研究确定,当FD_Mine应用于具有多行(> 100,000行)和少列(<14列)的数据集时,FD_Mine是最有效的FD发现算法。这使它在管理矿山临床和人口统计数据集方面处于特殊位置,这些数据集通常由长而窄的参与者记录集组成。描述并补充了FD_Mine的结构,并正式证明了所使用的等价修剪方法。 FDTool是FD_Mine的重新实现,添加了其他功能以提高性能并自动执行数据库体系结构中的典型流程。根据检查的FD数量,找到的FD数量以及代码终止所需的时间,总结了将FDTool应用于13个不同维度的数据集的实验结果。我们发现,数据集中的属性数量对FDTool的运行时间和内存成本的影响远大于行数。最后一部分详细说明了如何访问,执行和进一步开发FDTool应用程序。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号