A CiteSeerX-Based Dataset for Record Linkage and Metadata Extraction

机译：基于CITERESERX的数据集，用于记录链接和元数据提取

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Data cleaning constitutes an important problem in information science. Collecting data about the same entities from multiple sources or following distinct methodologies might result in slightly different, inconsistent data. The objective of data cleaning is to produce a fused version combining the differing data, resulting in a cleaner dataset. In this paper we collect document metadata records from CiteSeerX and build a supervised record linker to Crossref. The supervised method is trained using a manually linked dataset containing 512 verified DOIs-to our knowledge, up to now being the largest such dataset for bibliographic record linkage. We experiment using different supervised learning methods, and also prove experimentally that the accuracy of the attached metadata records can improve the performance of automatic metadata extraction systems.

机译：数据清洁构成了信息科学的重要问题。收集来自多个源或以下不同方法的相同实体的数据可能导致略有不同，数据不一致。数据清洁的目的是产生组合不同数据的融合版本，从而产生清洁数据集。在本文中，我们收集来自CiteSeerx的文档元数据记录并将监督记录链接器构建到CrossRef。监督方法使用包含512个验证的DOIS的手动链接数据集进行培训，这是我们的知识，目前是本名为书目记录联动的最大的数据集。我们使用不同的监督学习方法进行实验，并通过实验证明所附元数据记录的准确性可以提高自动元数据提取系统的性能。

著录项

来源
《International Symposium on Symbolic and Numeric Algorithms for Scientific Computing》|2018年|1 v.|共7页
会议地点
作者
Zalán Bodó;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算技术、计算机技术;
关键词
bibliographic systems; data acquisition; document handling; meta data; supervised learning;

机译：书目系统;数据习得;文件处理;元数据;监督学习;

相似文献

外文文献
中文文献
专利

1. BRAZILIAN HEALTHCARE RECORD LINKAGE (BRHC-RLK) - A RECORD LINKAGE METHODOLOGY FOR BRAZILIAN MEDICAL CLAIMS DATASETS (DATASUS) [J] . Campos D. F., Rosim R. P., Duva A. S., Value in health: the journal of the International Society for Pharmacoeconomics and Outcomes Research . 2017,第5期

机译：巴西医疗保健记录联动（BRHC-RLK） - 巴西医疗索赔数据集（DataSus）的记录联系方式
2. Business datasets and record linkage: Correlates of linkage and estimating risks of non-linkage biases. [J] . Jamie Moore, Gabriele Durrant, Peter W. Smith International Journal of Population Data Science . 2017,第1期

机译：业务数据集和记录链接：链接的相关性和估计非链接偏差的风险。
3. Evaluating privacy-preserving record linkage using cryptographic long-term keys and multibit trees on large medical datasets [J] . Adrian P. Brown, Christian Borgs, Sean M. Randall, BMC Medical Informatics and Decision Making . 2017,第1期

机译：在大型医疗数据集上使用加密的长期密钥和多位树评估隐私保护记录链接
4. A CiteSeerX-Based Dataset for Record Linkage and Metadata Extraction [C] . Zalán Bodó International Symposium on Symbolic and Numeric Algorithms for Scientific Computing . 2018

机译：基于CiteSeerX的数据集，用于记录链接和元数据提取
5. Informing, evaluating and automating the record linkage process for reliably combining disparate datasets. [D] . DuVall, Scott Leroy. 2010

机译：通知，评估和自动化记录链接过程，以可靠地组合不同的数据集。
6. A novel metadata management model to capture consent for record linkage in longitudinal research studies [O] . Christiana McMahon, Spiros Denaxas -1

机译：一种新颖的元数据管理模型用于捕获同意以进行纵向研究中的记录链接
7. Cleaning Noisy and Heterogeneous Metadata for Record Linking across Scholarly Big Datasets [O] . Athar Sefid, Jian Wu, Allen C. Ge, 2019

机译：清洁嘈杂和异构元数据，用于记录学术大型数据集的记录

A CiteSeerX-Based Dataset for Record Linkage and Metadata Extraction

摘要

著录项

相似文献

相关主题

期刊订阅