...
首页> 外文期刊>Data technologies and applications >Entity deduplication in big data graphs for scholarly communication
【24h】

Entity deduplication in big data graphs for scholarly communication

机译:实体在大数据图重复数据删除学术交流

获取原文
获取原文并翻译 | 示例
           

摘要

Purpose Several online services offer functionalities to access information from "big research graphs" (e.g. Google Scholar, OpenAIRE, Microsoft Academic Graph), which correlate scholarly/scientific communication entities such as publications, authors, datasets, organizations, projects, funders, etc. Depending on the target users, access can vary from search and browse content to the consumption of statistics for monitoring and provision of feedback. Such graphs are populated over time as aggregations of multiple sources and therefore suffer from major entity-duplication problems. Although deduplication of graphs is a known and actual problem, existing solutions are dedicated to specific scenarios, operate on flat collections, local topology-drive challenges and cannot therefore be re-used in other contexts. Design/methodology/approach This work presents GDup, an integrated, scalable, general-purpose system that can be customized to address deduplication over arbitrary large information graphs. The paper presents its high-level architecture, its implementation as a service used within the OpenAIRE infrastructure system and reports numbers of real-case experiments. Findings GDup provides the functionalities required to deliver a fully-fledged entity deduplication workflow over a generic input graph. The system offers out-of-the-box Ground Truth management, acquisition of feedback from data curators and algorithms for identifying and merging duplicates, to obtain an output disambiguated graph. Originality/value To our knowledge GDup is the only system in the literature that offers an integrated and general-purpose solution for the deduplication graphs, while targeting big data scalability issues. GDup is today one of the key modules of the OpenAIRE infrastructure production system, which monitors Open Science trends on behalf of the European Commission, National funders and institutions.
机译:一些在线服务提供从“大功能来访问信息研究图”(如谷歌学者,OpenAIRE,微软学术图)关联学术/科学通信实体等出版物、作者、数据集,组织、项目、资助者、等等。在目标用户,从搜索访问可能会有所不同和浏览内容的消费统计监测和提供反馈。因此聚合多个来源遭受重大entity-duplication问题。虽然图表是一个已知的和重复数据删除实际问题,现有的解决方案是专用的特定的场景,操作平的集合,和当地topology-drive挑战因此不能重用在其他上下文。设计/方法/方法这项工作了GDup,一个集成的、可伸缩的、通用的系统可以定制的地址重复数据删除在任意大的信息图表。作为服务架构,其实现OpenAIRE内使用的基础设施系统和报告的实际情况实验。发现GDup提供了功能需要提供一个成熟的实体重复数据删除在通用工作流的输入图。真理管理、收购的反馈数据策展人,识别和算法合并复制,获得一个输出消除了歧义的图。GDup是唯一的系统知识文学提供了一个集成的和重复数据删除技术的通用解决方案图,而针对大数据的可伸缩性问题。OpenAIRE基础设施生产系统中,跟踪开放科学趋势的代表吗欧盟委员会、国家资助者和机构。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号