首页> 外文会议>2014 IEEE 30th International Conference on Data Engineering Workshops >Scholarly big data information extraction and integration in the CiteSeer#x03C7; digital library
【24h】

Scholarly big data information extraction and integration in the CiteSeer#x03C7; digital library

机译:CiteSeer χ数字图书馆中的学术大数据信息提取和集成

获取原文
获取原文并翻译 | 示例

摘要

CiteSeerχ is a digital library that contains approximately 3.5 million scholarly documents and receives between 2 and 4 million requests per day. In addition to making documents available via a public Website, the data is also used to facilitate research in areas like citation analysis, co-author network analysis, scalability evaluation and information extraction. The papers in CiteSeerχ are gathered from the Web by means of continuous automatic focused crawling and go through a series of automatic processing steps as part of the ingestion process. Given the size of the collection, the fact that it is constantly expanding, and the multiple ways in which it is used both by the public to access scholarly documents and for research, there are several big data challenges. In this paper, we provide a case study description of how we address these challenges when it comes to information extraction, data integration and entity linking in CiteSeerχ. We describe how we: aggregate data from multiple sources on the Web; store and manage data; process data as part of an automatic ingestion pipeline that includes automatic metadata and information extraction; perform document and citation clustering; perform entity linking and name disambiguation; and make our data and source code available to enable research and collaboration.
机译:CiteSeer χ是一个数字图书馆,包含大约350万份学术文档,每天接收2到400万个请求。除了通过公共网站提供文档外,该数据还用于促进在引文分析,合著者网络分析,可伸缩性评估和信息提取等领域的研究。 CiteSeer χ中的论文是通过连续自动聚焦爬行从Web上收集的,并经过一系列自动处理步骤作为摄取过程的一部分。考虑到馆藏的规模,馆藏规模不断扩大的事实以及公众使用多种方式访问​​学术文献和进行研究,存在着一些大数据挑战。在本文中,我们提供了一个案例研究描述,说明了如何解决CiteSeer χ中的信息提取,数据集成和实体链接方面的这些挑战。我们描述了我们如何:在Web上聚合来自多个源的数据;以及存储和管理数据;处理数据作为自动摄取管道的一部分,包括自动元数据和信息提取;执行文档和引文聚类;执行实体链接和名称歧义化;并提供我们的数据和源代码以进行研究和协作。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号