首页> 外文期刊>The VLDB journal >Large-scale linked data integration using probabilistic reasoning and crowdsourcing
【24h】

Large-scale linked data integration using probabilistic reasoning and crowdsourcing

机译:使用概率推理和众包的大规模链接数据集成

获取原文
获取原文并翻译 | 示例
           

摘要

We tackle the problems of semiautomatically matching linked data sets and of linking large collections of Web pages to linked data. Our system, ZenCrowd, (1) uses a three-stage blocking technique in order to obtain the best possible instance matches while minimizing both computational complexity and latency, and (2) identifies entities from natural language text using state-of-the-art techniques and automatically connects them to the linked open data cloud. First, we use structured inverted indices to quickly find potential candidate results from entities that have been indexed in our system. Our system then analyzes the candidate matches and refines them whenever deemed necessary using computationally more expensive queries on a graph database. Finally, we resort to human computation by dynamically generating crowdsourcing tasks in case the algorithmic components fail to come up with convincing results. We integrate all results from the inverted indices, from the graph database and from the crowd using a probabilistic framework in order to make sensible decisions about candidate matches and to identify unreliable human workers. In the following, we give an overview of the architecture of our system and describe in detail our novel three-stage blocking technique and our probabilistic decision framework. We also report on a series of experimental results on a standard data set, showing that our system can achieve a 95 % average accuracy on instance matching (as compared to the initial 88 % average accuracy of the purely automatic baseline) while drastically limiting the amount of work performed by the crowd. The experimental evaluation of our system on the entity linking task shows an average relative improvement of 14 % over our best automatic approach.
机译:我们解决了半自动匹配链接数据集以及将大量Web页面链接到链接数据的问题。我们的系统ZenCrowd(1)使用三阶段阻塞技术,以在尽可能降低实例计算复杂度和延迟的同时获得最佳实例匹配,以及(2)使用最新技术从自然语言文本中识别实体技术并将它们自动连接到链接的开放数据云。首先,我们使用结构化的倒排索引来快速找到已在我们系统中建立索引的实体的潜在候选结果。然后,我们的系统分析候选匹配项,并在认为必要时使用图形数据库上计算上更昂贵的查询对它们进行优化。最后,在算法组件无法得出令人信服的结果的情况下,我们通过动态生成众包任务来求助于人工计算。我们使用概率框架整合来自倒排索引,图形数据库和人群的所有结果,以便对候选人匹配做出明智的决策并识别不可靠的人工工人。在下文中,我们概述了系统的体系结构,并详细描述了我们新颖的三阶段阻塞技术和概率决策框架。我们还在标准数据集上报告了一系列实验结果,表明我们的系统在实例匹配方面可以达到95%的平均准确度(与之相比,纯自动基准的最初88%的平均准确度)人群完成的工作。对我们的系统进行的实体链接任务的实验评估表明,与我们最佳的自动方法相比,平均相对改进了14%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号