首页> 外文期刊>Information retrieval >Dynamic author name disambiguation for growing digital libraries
【24h】

Dynamic author name disambiguation for growing digital libraries

机译:动态的作者姓名消除了不断增长的数字图书馆的歧义

获取原文
获取原文并翻译 | 示例
           

摘要

When a digital library user searches for publications by an author name, she often sees a mixture of publications by different authors who have the same name. With the growth of digital libraries and involvement of more authors, this author ambiguity problem is becoming critical. Author disambiguation (AD) often tries to solve this problem by leveraging metadata such as coauthors, research topics, publication venues and citation information, since more personal information such as the contact details is often restricted or missing. In this paper, we study the problem of how to efficiently disambiguate author names given an incessant stream of published papers. To this end, we propose a "BatchAD+IncAD" framework for dynamic author disambiguation. First, we perform batch author disambiguation (BatchAD) to disambiguate all author names at a given time by grouping all records (each record refers to a paper with one of its author names) into disjoint clusters. This establishes a one-to-one mapping between the clusters and real-world authors. Then, for newly added papers, we periodically perform incremental author disambiguation (IncAD), which determines whether each new record can be assigned to an existing cluster, or to a new cluster not yet included in the previous data. Based on the new data, IncAD also tries to correct previous AD results. Our main contributions are: (1) We demonstrate with real data that a small number of new papers often have overlapping author names with a large portion of existing papers, so it is challenging for IncAD to effectively leverage previous AD results. (2) We propose a novel IncAD model which aggregates metadata from a cluster of records to estimate the author's profile such as her coauthor distributions and keyword distributions, in order to predict how likely it is that a new record is "produced" by the author. (3) Using two labeled datasets and one large-scale raw dataset, we show that the proposed method is much more efficient than state-of-the-art methods while ensuring high accuracy.
机译:当数字图书馆用户按作者姓名搜索出版物时,她经常会看到名称不同的不同作者的出版物混合在一起。随着数字图书馆的增长和更多作者的参与,这个作者歧义性问题变得越来越重要。作者歧义消除(AD)通常试图通过利用诸如合著者,研究主题,出版地点和引用信息之类的元数据来解决此问题,因为通常会限制或丢失诸如联系方式之类的更多个人信息。在本文中,我们研究了如何在不断发表论文的情况下有效地消除作者姓名的歧义。为此,我们为动态作者消除歧义提出了一个“ BatchAD + IncAD”框架。首先,我们执行批处理作者歧义消除(BatchAD),以通过将所有记录(每条记录引用具有其作者名之一的论文)分组为不相交的簇来消除给定时间的所有作者名。这将在群集和实际作者之间建立一对一的映射。然后,对于新添加的论文,我们会定期执行增量作者歧义消除(IncAD),以确定每个新记录是可以分配给现有群集,还是可以分配给先前数据中尚未包括的新群集。基于新数据,IncAD还尝试更正以前的AD结果。我们的主要贡献是:(1)我们用真实数据证明,少数新论文的作者姓名经常与大部分现有论文重叠,因此对于IncAD来说,有效利用以前的AD结果具有挑战性。 (2)我们提出了一个新颖的IncAD模型,该模型聚集记录集群中的元数据以估计作者的个人资料(例如她的共同作者分布和关键字分布),以便预测作者“产生”新记录的可能性。 (3)使用两个标记的数据集和一个大规模的原始数据集,我们证明了该方法在确保高精度的同时,比最先进的方法有效得多。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号