Scalable Methods for Measuring the Connectivity and Quality of Large Numbers of Linked Datasets

MICHALIS MOUNTANTONAKIS; YANNIS TZITZIKAS

首页> 外文期刊>ACM journal of data and information quality >Scalable Methods for Measuring the Connectivity and Quality of Large Numbers of Linked Datasets

【24h】

Scalable Methods for Measuring the Connectivity and Quality of Large Numbers of Linked Datasets

机译：用于测量大量链接数据集的连接和质量的可扩展方法

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Although the ultimate objective of Linked Data is linking and integration, it is not currently evident how connected the current Linked Open Data (LOD) cloud is. In this article, we focus on methods, supported by special indexes and algorithms, for performing measurements related to the connectivity of more than two datasets that are useful in various tasks including (a) Dataset Discovery and Selection; (b) Object Coreference, i.e., for obtaining complete information about a set of entities, including provenance information; (c) Data Quality Assessment and Improvement, i.e., for assessing the connectivity between any set of datasets and monitoring their evolution over time, as well as for estimating data veracity; (d) Dataset Visualizations; and various other tasks. Since it would be prohibitively expensive to perform all these measurements in a naive way, in this article, we introduce indexes (and their construction algorithms) that can speed up such tasks. In brief, we introduce (i) a namespace-based prefix index, (ii) a sameAs catalog for computing the symmetric and transitive closure of the owl: sameAs relationships encountered in the datasets, (iii) a semantics-aware element index (that exploits the aforementioned indexes), and, finally, (iv) two lattice-based incremental algorithms for speeding up the computation of the intersection of URIs of any set of datasets. For enhancing scalability, we propose parallel index construction algorithms and parallel lattice-based incremental algorithms, we evaluate the achieved speedup using either a single machine or a cluster of machines, and we provide insights regarding the factors that affect efficiency. Finally, we report measurements about the connectivity of the (billion triples-sized) LOD cloud that have never been carried out so far.

机译：虽然链接数据的最终目标是链接和集成，但它目前尚不明显连接当前链接的开放数据（LOD）云是。在本文中，我们专注于由特殊索引和算法支持的方法，用于执行与两种多个数据集的连接相关的测量，这些数据集在各种任务中有用，包括（a）数据集发现和选择; （b）对象Coreference，即，用于获取有关一组实体的完整信息，包括出处信息; （c）数据质量评估和改进，即评估任何数据集之间的连接，并随时间监测它们的演变，以及估计数据准确性; （d）数据集可视化;和各种其他任务。由于在本文中以天真的方式执行所有这些测量，因此在本文中介绍了可以加快此类任务的索引（及其构造算法）。简而言之，我们介绍（i）基于命名空间的前缀索引，（ii）一个用于计算OWL的对称和传递闭合的SAMEA目录：在数据集中遇到的SAMEAS关系，（iii）一个语义感知元素索引（即利用上述索引），最后，（iv）基于晶格的增量算法，用于加速任何数据集合的URI的交叉点的计算。为了提高可扩展性，我们提出了并行索引建设算法和基于晶格的增量算法，我们使用单个机器或机器群体评估实现的加速，我们提供了有关影响效率的因素的见解。最后，我们报告了关于到目前为止从未执行过的（十亿三体大小）LOD云的连通性的测量。

著录项

来源
《ACM journal of data and information quality》 |2018年第3期|共49页
作者
MICHALIS MOUNTANTONAKIS; YANNIS TZITZIKAS;
展开▼
作者单位

Institute of Computer Science (ICS) Foundation for Research and Technology - Hellas (FORTH) N. Plastira 100 Vassilika Vouton GR-700 13 Heraklion Crete Greece;

Institute of Computer Science (ICS) Foundation for Research and Technology - Hellas (FORTH) N. Plastira 100 Vassilika Vouton GR-700 13 Heraklion Crete Greece;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计量学;
关键词
Data quality; dataset discovery; dataset selection; linked data; connectivity; lattice of measurements; big data; mapreduce; spark;

机译：数据质量;数据集发现;数据集选择;链接数据;连接;测量格子;大数据;mapreduce;火花;

相似文献

外文文献
中文文献
专利

1. Scalable Methods for Measuring the Connectivity and Quality of Large Numbers of Linked Datasets [J] . MICHALIS MOUNTANTONAKIS, YANNIS TZITZIKAS ACM journal of data and information quality . 2018,第3期

机译：用于测量大量链接数据集的连接和质量的可扩展方法
2. Linking pharmacy dispensing data to other administrative health datasets to measure the compliance and effectiveness of RSV immunoprophylaxis [J] . Hannah Moore, Tobias Strunk, Tasnim Abdalla, International Journal of Population Data Science . 2018,第4期

机译：将药房配药数据链接到其他行政健康数据集，以测量RSV免疫预防的依从性和有效性
3. Methods for dealing with discrepant records in linked population health datasets: a cross-sectional study [J] . Christine L Roberts, Charles S Algert, Jane B Ford BMC Health Services Research . 2007,第1期

机译：链接的人口健康数据集中处理差异记录的方法：一项横断面研究
4. On Measuring the Lattice of Commonalities Amor Several Linked Datasets [C] . Michalis Mountantonakis, Yannis Tzitzikas International conference on very large data bases . 2016

机译：关于测量公共格的格的若干个链接数据集
5. Statistical and Computational Methods for Analyzing and Visualizing Large-Scale Genomic Datasets [D] . Kwong, Alan M. 2020

机译：用于分析和可视化大规模基因组数据集的统计和计算方法
6. Methods for dealing with discrepant records in linked population health datasets: a cross-sectional study [O] . Christine L Roberts, Charles S Algert, Jane B Ford 2007

机译：链接的人口健康数据集中处理差异记录的方法：一项横断面研究
7. VIPPrint: Validating Synthetic Image Detection and Source Linking Methods on a Large Scale Dataset of Printed Documents [O] . Anselmo Ferreira, Ehsan Nowroozi, Mauro Barni 2021

机译：vipprint：在打印文档的大型数据集上验证合成图像检测和源链接方法

Scalable Methods for Measuring the Connectivity and Quality of Large Numbers of Linked Datasets

摘要

著录项

相似文献

相关主题

期刊订阅