Scholarly big data information extraction and integration in the CiteSeer^#x03C7; digital library

机译：CiteSeer ^{χ数字图书馆中的学术大数据信息提取和集成}

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

CiteSeer^χ is a digital library that contains approximately 3.5 million scholarly documents and receives between 2 and 4 million requests per day. In addition to making documents available via a public Website, the data is also used to facilitate research in areas like citation analysis, co-author network analysis, scalability evaluation and information extraction. The papers in CiteSeer^χ are gathered from the Web by means of continuous automatic focused crawling and go through a series of automatic processing steps as part of the ingestion process. Given the size of the collection, the fact that it is constantly expanding, and the multiple ways in which it is used both by the public to access scholarly documents and for research, there are several big data challenges. In this paper, we provide a case study description of how we address these challenges when it comes to information extraction, data integration and entity linking in CiteSeer^χ. We describe how we: aggregate data from multiple sources on the Web; store and manage data; process data as part of an automatic ingestion pipeline that includes automatic metadata and information extraction; perform document and citation clustering; perform entity linking and name disambiguation; and make our data and source code available to enable research and collaboration.

机译：CiteSeer ^{χ是一个数字图书馆，包含大约350万份学术文档，每天接收2到400万个请求。除了通过公共网站提供文档外，该数据还用于促进在引文分析，合著者网络分析，可伸缩性评估和信息提取等领域的研究。 CiteSeer ^{χ中的论文是通过连续自动聚焦爬行从Web上收集的，并经过一系列自动处理步骤作为摄取过程的一部分。考虑到馆藏的规模，馆藏规模不断扩大的事实以及公众使用多种方式访问学术文献和进行研究，存在着一些大数据挑战。在本文中，我们提供了一个案例研究描述，说明了如何解决CiteSeer ^{χ中的信息提取，数据集成和实体链接方面的这些挑战。我们描述了我们如何：在Web上聚合来自多个源的数据；以及存储和管理数据；处理数据作为自动摄取管道的一部分，包括自动元数据和信息提取；执行文档和引文聚类；执行实体链接和名称歧义化；并提供我们的数据和源代码以进行研究和协作。}}}

著录项

来源
《2014 IEEE 30th International Conference on Data Engineering Workshops》|2014年|68-73|共6页
会议地点 Chicago IL(US)
作者
Williams Kyle; Wu Jian; Choudhury Sagnik Ray; Khabsa Madian;
展开▼
作者单位

Information Sciences and Technology, Pennsylvania State University, University Park, PA 16802, USAc;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Dimensions and Use of the Scholarly Information Environment: Introduction to a Data Set Assembled by the Digital Library Federation and Outsell, Inc. [J] . Anonymous Information Intelligence Online Libraries and Microcomputers . 2002,第12期

机译：学术信息环境的规模和使用：由数字图书馆联合会和Outsell，Inc.组装的数据集简介。
2. Subscription to Digital Libraries and Corresponding Journal Impact: A Value-Based Approach to Demand for Digital Research Data—Confucian Integration of Curricula and “Market String” Digital Education Systems [J] . Soumitra K. Mallick Journal of Applied Mathematics and Physics . 2018,第10期

机译：订阅数字图书馆和相应的期刊影响：一种基于价值的数字研究数据需求方法—儒家课程与“市场字符串”数字教育系统的整合
3. From information, to data, to knowledge – Digital Scholarship Centers: An emerging transdisciplinary digital knowledge and research methods integrator in academic and research libraries [J] . IFLA Journal . 2020,第1期

机译：从信息，数据到知识 - 数字奖学金中心：新兴的转床数字知识和研究方法在学术和研究图书馆中的集成商
4. Scholarly big data information extraction and integration in the CiteSeer#x03C7; digital library [C] . Williams Kyle, Wu Jian, Choudhury Sagnik Ray, IEEE international conference on data engineering . 2014

机译：CiteSeer ^{χ数字图书馆的学术大数据信息提取和集成}
5. Information Extraction and Metadata Annotation for Algorithms in Digital Libraries [D] . Tuarob, Suppawong. 2015

机译：数字图书馆中算法的信息提取和元数据注释
6. FNG-IE: an improved graph-based method for keyword extraction from scholarly big-data [O] . Noman Tahir, Muhammad Asif, Shahbaz Ahmad, 2021

机译：FNG-IE：从学术大数据的关键字提取的基于基于图的基于图形方法
7. Specialized Research Datasets in the CiteSeer˟ Digital Library [O] . Bhatia, Sumit, Caragea, Cornelia, Chen, Hung-Hsuan, 2012

机译：Citeseer˟数字图书馆的专业研究数据集

Scholarly big data information extraction and integration in the CiteSeer#x03C7; digital library

摘要

著录项

相似文献

相关主题

期刊订阅

Scholarly big data information extraction and integration in the CiteSeer^#x03C7; digital library