...
首页> 外文期刊>ACM journal of data and information quality >Scalable Methods for Measuring the Connectivity and Quality of Large Numbers of Linked Datasets
【24h】

Scalable Methods for Measuring the Connectivity and Quality of Large Numbers of Linked Datasets

机译:用于测量大量链接数据集的连接和质量的可扩展方法

获取原文
获取原文并翻译 | 示例
           

摘要

Although the ultimate objective of Linked Data is linking and integration, it is not currently evident how connected the current Linked Open Data (LOD) cloud is. In this article, we focus on methods, supported by special indexes and algorithms, for performing measurements related to the connectivity of more than two datasets that are useful in various tasks including (a) Dataset Discovery and Selection; (b) Object Coreference, i.e., for obtaining complete information about a set of entities, including provenance information; (c) Data Quality Assessment and Improvement, i.e., for assessing the connectivity between any set of datasets and monitoring their evolution over time, as well as for estimating data veracity; (d) Dataset Visualizations; and various other tasks. Since it would be prohibitively expensive to perform all these measurements in a naive way, in this article, we introduce indexes (and their construction algorithms) that can speed up such tasks. In brief, we introduce (i) a namespace-based prefix index, (ii) a sameAs catalog for computing the symmetric and transitive closure of the owl: sameAs relationships encountered in the datasets, (iii) a semantics-aware element index (that exploits the aforementioned indexes), and, finally, (iv) two lattice-based incremental algorithms for speeding up the computation of the intersection of URIs of any set of datasets. For enhancing scalability, we propose parallel index construction algorithms and parallel lattice-based incremental algorithms, we evaluate the achieved speedup using either a single machine or a cluster of machines, and we provide insights regarding the factors that affect efficiency. Finally, we report measurements about the connectivity of the (billion triples-sized) LOD cloud that have never been carried out so far.
机译:虽然链接数据的最终目标是链接和集成,但它目前尚不明显连接当前链接的开放数据(LOD)云是。在本文中,我们专注于由特殊索引和算法支持的方法,用于执行与两种多个数据集的连接相关的测量,这些数据集在各种任务中有用,包括(a)数据集发现和选择; (b)对象Coreference,即,用于获取有关一组实体的完整信息,包括出处信息; (c)数据质量评估和改进,即评估任何数据集之间的连接,并随时间监测它们的演变,以及估计数据准确性; (d)数据集可视化;和各种其他任务。由于在本文中以天真的方式执行所有这些测量,因此在本文中介绍了可以加快此类任务的索引(及其构造算法)。简而言之,我们介绍(i)基于命名空间的前缀索引,(ii)一个用于计算OWL的对称和传递闭合的SAMEA目录:在数据集中遇到的SAMEAS关系,(iii)一个语义感知元素索引(即利用上述索引),最后,(iv)基于晶格的增量算法,用于加速任何数据集合的URI的交叉点的计算。为了提高可扩展性,我们提出了并行索引建设算法和基于晶格的增量算法,我们使用单个机器或机器群体评估实现的加速,我们提供了有关影响效率的因素的见解。最后,我们报告了关于到目前为止从未执行过的(十亿三体大小)LOD云的连通性的测量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号