首页> 外文会议>Trends and applications in knowledge discovery and data mining >A Simhash-Based Generalized Framework for Citation Matching in MapReduce
【24h】

A Simhash-Based Generalized Framework for Citation Matching in MapReduce

机译:MapReduce中基于Simhash的引文匹配通用框架

获取原文
获取原文并翻译 | 示例

摘要

Citation matching is to find the cited papers according to only a small amount of information. There have been some works on citation matching. Most of the solutions require expensive model processing to achieve good results. However, when dealing with millions of citations in large digital libraries, these solutions may not be efficient enough. To address this problem, we propose a simhash-based generalized framework in MapReduce for citation matching. In the framework, we use title exact matching and distance-based short text similarity metrics to implement citation matching. Moreover, customizing citation fields, citation field weights and word segmentation weights are used for improving the accuracy. We also design a heuristic algorithm which can automatically calculate the weights of each citation field. For disposing the large-scale datasets, we implement the framework in Hadoop, a popular parallel computation platform. We do our experiments with real datasets from a Chinese Medicine Digital Library, and a comparative experiment with Cora corpus (McCallum's citation matching test set). The results of experiments confirm the efficiency and effectiveness of our framework.
机译:引文匹配是指仅根据少量信息来查找被引论文。关于引文匹配已经有一些作品。大多数解决方案需要昂贵的模型处理才能获得良好的结果。但是,当处理大型数字图书馆中的数百万引用时,这些解决方案可能不够高效。为了解决这个问题,我们在MapReduce中提出了一个基于simhash的通用框架进行引文匹配。在框架中,我们使用标题精确匹配和基于距离的短文本相似性度量来实现引文匹配。此外,自定义引文字段,引文字段权重和分词权重可用于提高准确性。我们还设计了一种启发式算法,可以自动计算每个引文字段的权重。为了处理大规模数据集,我们在流行的并行计算平台Hadoop中实现了该框架。我们使用来自中国医学数字图书馆的真实数据集进行实验,并使用Cora语料库(McCallum的引文匹配测试集)进行对比实验。实验结果证实了我们框架的效率和有效性。

著录项

  • 来源
  • 会议地点 Ho Chi Minh City(VN)
  • 作者单位

    Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecommunications, Beijing 100876, China;

    Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecommunications, Beijing 100876, China;

    Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecommunications, Beijing 100876, China;

    Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecommunications, Beijing 100876, China;

    Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecommunications, Beijing 100876, China;

  • 会议组织
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Citation matching; Parallelization; Short text similarity; MapReduce;

    机译:引文匹配;并行化;短文本相似度; MapReduce;
  • 入库时间 2022-08-26 14:12:46

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号