首页> 外文会议>International conference on very large data bases >Scalable Column Concept Determination for Web Tables Using Large Knowledge Bases
【24h】

Scalable Column Concept Determination for Web Tables Using Large Knowledge Bases

机译:使用大知识库的可扩展列概念确定Web表

获取原文

摘要

Tabular data on the Web has become a rich source of structured data that is useful for ordinary users to explore. Due to its potential, tables on the Web have recently attracted a number of studies with the goals of understanding the semantics of those Web tables and providing effective search and exploration mechanisms over them. An important part of table understanding and search is column concept determination, i.e., identifying the most appropriate concepts associated with the columns of the tables. The problem becomes especially challenging with the availability of increasingly rich knowledge bases that contain hundreds of millions of entities. In this paper, we focus on an important instantiation of the column concept determination problem, namely, the concepts of a column are determined by fuzzy matching its cell values to the entities within a large knowledge base. We provide an efficient and scalable MapReduce-based solution that is scalable to both the number of tables and the size of the knowledge base and propose two novel techniques: knowledge concept aggregation and knowledge entity partition. We prove that both the problem of finding the optimal aggregation strategy and that of finding the optimal partition strategy are NP-hard, and propose efficient heuristic techniques by leveraging the hierarchy of the knowledge base. Experimental results on real-world datasets show that our method achieves high annotation quality and performance, and scales well.
机译:Web上的表格数据已成为具有普通用户探索的有用的结构化数据的丰富源。由于其潜力,网络上的表最近吸引了许多研究,以了解这些网络表的语义,并为它们提供有效的搜索和探索机制。表格和搜索的重要部分是列概念确定,即,识别与表的列关联的最合适的概念。问题变得尤为挑战,越来越丰富的知识库,含有数亿个实体的知识库。在本文中,我们专注于列概念确定问题的重要实例,即,列的概念由模糊匹配其小区值与大知识库内的实体确定。我们提供了一种高效且可扩展的MapReduce的解决方案,可扩展到知识库的表数和大小,并提出了两种新颖的技术:知识概念聚合和知识实体分区。我们证明了找到最佳聚合策略的问题以及找到最佳分区策略的问题是NP-Hard,并通过利用知识库的等级来提出高效的启发式技术。实验结果对现实世界数据集表明,我们的方法达到了高注释质量和性能,并衡量良好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号