A CiteSeerX-Based Dataset for Record Linkage and Metadata Extraction

机译：基于CiteSeerX的数据集，用于记录链接和元数据提取

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Data cleaning constitutes an important problem in information science. Collecting data about the same entities from multiple sources or following distinct methodologies might result in slightly different, inconsistent data. The objective of data cleaning is to produce a fused version combining the differing data, resulting in a cleaner dataset. In this paper we collect document metadata records from CiteSeerX and build a supervised record linker to Crossref. The supervised method is trained using a manually linked dataset containing 512 verified DOIs-to our knowledge, up to now being the largest such dataset for bibliographic record linkage. We experiment using different supervised learning methods, and also prove experimentally that the accuracy of the attached metadata records can improve the performance of automatic metadata extraction systems.

机译：数据清理是信息科学中的一个重要问题。从多个来源或遵循不同的方法收集有关同一实体的数据可能会导致数据略有不同，不一致。数据清理的目的是生成一个融合了不同数据的融合版本，从而得到了一个更干净的数据集。在本文中，我们从CiteSeerX收集文档元数据记录，并建立一个与Crossref的监督记录链接器。据我们所知，监督方法是使用包含512个已验证DOI的手动链接数据集进行训练的，到目前为止，该数据集是用于书目记录链接的最大此类数据集。我们使用不同的监督学习方法进行了实验，并通过实验证明了附加的元数据记录的准确性可以提高自动元数据提取系统的性能。

著录项

来源
《International Symposium on Symbolic and Numeric Algorithms for Scientific Computing》|2018年|230-236|共7页
会议地点
作者
Zalán Bodó;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
bibliographic systems; data acquisition; document handling; meta data; supervised learning;

机译：书目系统;数据获取;文档处理;元数据;监督学习;

相似文献

外文文献
中文文献
专利

1. BRAZILIAN HEALTHCARE RECORD LINKAGE (BRHC-RLK) - A RECORD LINKAGE METHODOLOGY FOR BRAZILIAN MEDICAL CLAIMS DATASETS (DATASUS) [J] . Campos D. F., Rosim R. P., Duva A. S., Value in health: the journal of the International Society for Pharmacoeconomics and Outcomes Research . 2017,第5期

机译：巴西医疗保健记录联动（BRHC-RLK） - 巴西医疗索赔数据集（DataSus）的记录联系方式
2. Business datasets and record linkage: Correlates of linkage and estimating risks of non-linkage biases. [J] . Jamie Moore, Gabriele Durrant, Peter W. Smith International Journal of Population Data Science . 2017,第1期

机译：业务数据集和记录链接：链接的相关性和估计非链接偏差的风险。
3. Evaluating privacy-preserving record linkage using cryptographic long-term keys and multibit trees on large medical datasets [J] . Adrian P. Brown, Christian Borgs, Sean M. Randall, BMC Medical Informatics and Decision Making . 2017,第1期

机译：在大型医疗数据集上使用加密的长期密钥和多位树评估隐私保护记录链接
4. A CiteSeerX-Based Dataset for Record Linkage and Metadata Extraction [C] . Zalán Bodó International Symposium on Symbolic and Numeric Algorithms for Scientific Computing . 2018

机译：基于CITERESERX的数据集，用于记录链接和元数据提取
5. Informing, evaluating and automating the record linkage process for reliably combining disparate datasets. [D] . DuVall, Scott Leroy. 2010

机译：通知，评估和自动化记录链接过程，以可靠地组合不同的数据集。
6. A novel metadata management model to capture consent for record linkage in longitudinal research studies [O] . Christiana McMahon, Spiros Denaxas -1

机译：一种新颖的元数据管理模型用于捕获同意以进行纵向研究中的记录链接
7. Cleaning Noisy and Heterogeneous Metadata for Record Linking across Scholarly Big Datasets [O] . Athar Sefid, Jian Wu, Allen C. Ge, 2019

机译：清洁嘈杂和异构元数据，用于记录学术大型数据集的记录

A CiteSeerX-Based Dataset for Record Linkage and Metadata Extraction

摘要

著录项

相似文献

相关主题

期刊订阅