首页> 外文会议>International semantic web conference >BTC-2019: The 2019 Billion Triple Challenge Dataset
【24h】

BTC-2019: The 2019 Billion Triple Challenge Dataset

机译:BTC-2019:2019十亿三重挑战数据集

获取原文

摘要

Six datasets have been published under the title of Billion Triple Challenge (BTC) since 2008. Each such dataset contains billions of triples extracted from millions of documents crawed from hundreds of domains. While these datasets were originally motivated by the annual ISWC competition from which they take their name, they would become widely used in other contexts, forming a key resource for a variety of research works concerned with managing and/or analysing diverse, real-world RDF data as found natively on the Web. Given that the last BTC dataset was published in 2014, we prepare and publish a new version -BTC-2019 - containing 2.2 billion quads parsed from 2.6 million documents on 394 pay-level-domains. This paper first motivates the BTC datasets with a survey of research works using these datasets. Next we provide details of how the BTC-2019 crawl was configured. We then present and discuss a variety of statistics that aim to gain insights into the content of BTC-2019. We discuss the hosting of the dataset and the ways in which it can be accessed, remixed and used.
机译:自2008年以来,已经发布了六个数据集,标题为Billion Triple Challenge(BTC)。每个这样的数据集包含数十亿个三元组,这些三元组是从数百个域中检索的数百万个文档中提取的。尽管这些数据集最初是由每年的ISWC竞赛所激发,并因此而得名,但它们将在其他情况下得到广泛使用,从而为与管理和/或分析各种现实RDF有关的各种研究工作提供了重要资源在Web上本地找到的数据。鉴于最后一个BTC数据集已于2014年发布,我们准备并发布了一个新版本-BTC-2019-包含22个四边形,由394个薪级域中的260万个文档解析而成。本文首先通过对使用这些数据集的研究工作进行调查来激发BTC数据集。接下来,我们提供有关如何配置BTC-2019爬网的详细信息。然后,我们介绍并讨论各种统计数据,旨在深入了解BTC-2019的内容。我们讨论了数据集的托管以及访问,重新混合和使用数据集的方式。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号