...
首页> 外文期刊>Advanced Science Letters >Evaluating the Quality of Datasets in Software Engineering
【24h】

Evaluating the Quality of Datasets in Software Engineering

机译:评估软件工程中数据集的质量

获取原文
获取原文并翻译 | 示例
           

摘要

Research based on datasets needs to determine the quality of the data that form the basis of the results. To facilitate this type of research, the meaning of the data needs to be interpreted correctly. A dataset essentially contains data and metadata. Data are often misinterpreted dueto insufficient metadata, which causes quality issues to be associated with the datasets, such as failure to clearly identify the entity being measured and inability to clarify how the measurements were generated. The goal of this study was to determine a useful way to evaluate the qualityof datasets. We developed a quality assessment process that includes four steps: assessing the datasets model, identifying the data quality issues, evaluating the metadata, and preparing the assessment report. We introduced formal definitions of data quality issues to identify the qualityissues, as well as an evaluation scale to evaluate the quality of the metadata in datasets. We applied the quality assessment process to 92 existing datasets from real data repositories and found a number of common data quality issues in these datasets, such as duplicate, incorrect and missingdata. We also found 70 datasets containing insufficient metadata for entity and metrics. Our quality assessment process can be used to determine datasets that might carry risks of data misinterpretation because of the absence of metadata. The process also allows researchers to draw conclusionsabout whether a dataset has sufficient metadata to support correct interpretation for analysis in empirical research.
机译:基于数据集的研究需要确定构成结果基础的数据的质量。为了促进这种类型的研究,需要正确解释数据的含义。数据集本质上包含数据和元数据。数据通常是误解的Dueto的元数据,这导致与数据集相关联的质量问题,例如未能清楚地识别正在测量的实体,无法澄清如何产生测量。本研究的目标是确定评估数据集质量的有用方法。我们开发了一个质量评估过程,包括四个步骤:评估数据集模型,识别数据质量问题,评估元数据,并准备评估报告。我们介绍了数据质量问题的正式定义,以识别质量,以及评估规模,以评估数据集中元数据的质量。我们将质量评估过程从真实数据存储库应用到92个现有数据集,并在这些数据集中发现了许多常见的数据质量问题,例如重复,不正确和遗漏数据。我们还发现了70个数据集,包含实体和度量标准的元数据不足。我们的质量评估过程可用于确定由于没有元数据而携带数据误解风险的数据集。该过程还允许研究人员得出结论数据集是否具有足够的元数据,以支持对实证研究分析的正确解释。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号