Entity deduplication in big data graphs for scholarly communication

Manghi Paolo; Atzori Claudio; De Bonis MicheleBardi Alessia

首页> 外文期刊>Data technologies and applications >Entity deduplication in big data graphs for scholarly communication

【24h】

Entity deduplication in big data graphs for scholarly communication

机译：实体在大数据图重复数据删除学术交流

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Purpose Several online services offer functionalities to access information from "big research graphs" (e.g. Google Scholar, OpenAIRE, Microsoft Academic Graph), which correlate scholarly/scientific communication entities such as publications, authors, datasets, organizations, projects, funders, etc. Depending on the target users, access can vary from search and browse content to the consumption of statistics for monitoring and provision of feedback. Such graphs are populated over time as aggregations of multiple sources and therefore suffer from major entity-duplication problems. Although deduplication of graphs is a known and actual problem, existing solutions are dedicated to specific scenarios, operate on flat collections, local topology-drive challenges and cannot therefore be re-used in other contexts. Design/methodology/approach This work presents GDup, an integrated, scalable, general-purpose system that can be customized to address deduplication over arbitrary large information graphs. The paper presents its high-level architecture, its implementation as a service used within the OpenAIRE infrastructure system and reports numbers of real-case experiments. Findings GDup provides the functionalities required to deliver a fully-fledged entity deduplication workflow over a generic input graph. The system offers out-of-the-box Ground Truth management, acquisition of feedback from data curators and algorithms for identifying and merging duplicates, to obtain an output disambiguated graph. Originality/value To our knowledge GDup is the only system in the literature that offers an integrated and general-purpose solution for the deduplication graphs, while targeting big data scalability issues. GDup is today one of the key modules of the OpenAIRE infrastructure production system, which monitors Open Science trends on behalf of the European Commission, National funders and institutions.

机译：一些在线服务提供从“大功能来访问信息研究图”(如谷歌学者,OpenAIRE,微软学术图)关联学术/科学通信实体等出版物、作者、数据集,组织、项目、资助者、等等。在目标用户,从搜索访问可能会有所不同和浏览内容的消费统计监测和提供反馈。因此聚合多个来源遭受重大entity-duplication问题。虽然图表是一个已知的和重复数据删除实际问题,现有的解决方案是专用的特定的场景,操作平的集合,和当地topology-drive挑战因此不能重用在其他上下文。设计/方法/方法这项工作了GDup,一个集成的、可伸缩的、通用的系统可以定制的地址重复数据删除在任意大的信息图表。作为服务架构,其实现OpenAIRE内使用的基础设施系统和报告的实际情况实验。发现GDup提供了功能需要提供一个成熟的实体重复数据删除在通用工作流的输入图。真理管理、收购的反馈数据策展人,识别和算法合并复制,获得一个输出消除了歧义的图。GDup是唯一的系统知识文学提供了一个集成的和重复数据删除技术的通用解决方案图,而针对大数据的可伸缩性问题。OpenAIRE基础设施生产系统中,跟踪开放科学趋势的代表吗欧盟委员会、国家资助者和机构。

著录项

来源
《Data technologies and applications》 |2020年第4期|409-435|共27页
作者
Manghi Paolo; Atzori Claudio; De Bonis MicheleBardi Alessia;
展开▼
作者单位

CNR, Ist Sci & Tecnol Informaz, Pisa, Italy;

展开▼
收录信息
原文格式 PDF
正文语种英语
中图分类
关键词
HIGH LEVEL ARCHITECTURE; big data; Critical blockScalabilityscholarly communicationInformation Services;

机译：高级体系结构;大数据至关重要blockScalabilityscholarlycommunicationInformation服务;

相似文献

外文文献
中文文献
专利

1. Load Balance Strategy of Data Routing Algorithm Using Semantics for Deduplication Clusters [J] . Ze-Jun Jiang, Zhi-Ke Zhang, Li-Fang Wang, 电子科技学刊：英文版 . 2017,第003期
2. Reflections on NISO plus or scholarly communications pre-COVID - how the world has changed [J] . Rouhi Sara Information Services & Use . 2020,第3期

机译：对NISO Plus或Scholarly Communications Pre-Covid的思考 - 世界如何变化
3. Dppa3 expression is critical for generation of fully reprogrammed iPS cells and maintenance of Dlk1-Dio3 imprinting [J] . Xingbo Xu, Lukasz Smorag, Toshinobu Nakamura, Nature Communications . 2015,第2016期

机译： Dppa3 表达对于生成完全重新编程的iPS细胞和维护 Dlk1 - Dio3 印记
4. Massive parallel sequencing uncovers actionable FGFR2–PPHLN1 fusion and ARAF mutations in intrahepatic cholangiocarcinoma [J] . Daniela Sia, Bojan Losic, Agrin Moeini, Nature Communications . 2015,第1期

机译：大规模并行测序发现可行的 FGFR2 – PPHLN1 融合和 <肝内胆管癌的named-entity> ARAF 突变
5. The Changing Face of Utility Data Communications: New Options for Real-time and Near Real-time Data for Utility Field Applications [C] . Jai Belagur, Thomas M. Lebakken DISTRIBUTech Conference and Exhibition . 2006

机译：Utility Data Communications的变化面部：实用现场应用程序的实时和近实时数据的新选项
6. Predictive Analysis of Real-Time Strategy Games Using Graph Mining [D] . Alobaidi, Isam Abdulmunem. 2019

机译：使用Graph Mining的实时战略游戏预测分析
7. Total Energy Intake and Intake of Three Major Nutrients by Body Mass Index in Japan: NIPPON DATA80 and NIPPON DATA90 [O] . Katsushi Yoshita, Yusuke Arai, Miho Nozue, 2010

机译：日本的人体总质量指数和三种主要营养素的总能量摄入量：NIPPON DATA80和NIPPON DATA90
8. Reflections on NISO plus or scholarly communications pre-COVID - how the world has changed [O] . Sara Rouhi 2020

机译：关于NISO Plus或Scholarly Communications Pre-Covid的思考 - 世界如何变化
9. Classification and Enumeration of Minimum (d,1,3)-Graphs and Minimum (d,2,3)-Graphs. [R] . klee,victor quaife,howard 1976

机译：最小（d，1,3）-Graphs和最小（d，2,3）-Graphs的分类和枚举。

Entity deduplication in big data graphs for scholarly communication

摘要

著录项

相似文献

相关主题

期刊订阅