首页> 外文会议>IEEE international conference on data engineering >Scholarly big data information extraction and integration in the CiteSeer#x03C7; digital library
【24h】

Scholarly big data information extraction and integration in the CiteSeer#x03C7; digital library

机译:CiteSeer χ数字图书馆的学术大数据信息提取和集成

获取原文
获取外文期刊封面目录资料

摘要

CiteSeerχ is a digital library that contains approximately 3.5 million scholarly documents and receives between 2 and 4 million requests per day. In addition to making documents available via a public Website, the data is also used to facilitate research in areas like citation analysis, co-author network analysis, scalability evaluation and information extraction. The papers in CiteSeerχ are gathered from the Web by means of continuous automatic focused crawling and go through a series of automatic processing steps as part of the ingestion process. Given the size of the collection, the fact that it is constantly expanding, and the multiple ways in which it is used both by the public to access scholarly documents and for research, there are several big data challenges. In this paper, we provide a case study description of how we address these challenges when it comes to information extraction, data integration and entity linking in CiteSeerχ. We describe how we: aggregate data from multiple sources on the Web; store and manage data; process data as part of an automatic ingestion pipeline that includes automatic metadata and information extraction; perform document and citation clustering; perform entity linking and name disambiguation; and make our data and source code available to enable research and collaboration.
机译:CITESEER χ是一个数字图书馆,其中包含大约350万学术文件,每天收到2至400万个要求。除了通过公共网站制作可用的文件,还用于促进引文分析,共同作者网络分析,可扩展性评估和信息提取等领域的研究。 CITESEER χ的论文通过连续自动聚焦爬网收集,并通过一系列自动处理步骤作为摄入过程的一部分。鉴于集合的大小,它不断扩展的事实,以及公众在访问学术文件和研究中使用它的多种方式,有几个大数据挑战。在本文中,我们提供了如何在CiteERER χ中链接的信息提取,数据集成和实体时如何解决这些挑战的案例研究描述。我们描述了我们如何:来自网络上的多个源的聚合数据;存储和管理数据;处理数据作为包括自动元数据和信息提取的自动摄取管道的一部分;执行文档和引文聚类;执行实体链接和名称歧义;并使我们的数据和源代码可用于启用Research和Collaboration。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号