首页> 外文会议>Workshop on Scholarly Document Processing >Bootstrapping Multilingual Metadata Extraction: A Showcase in Cyrillic
【24h】

Bootstrapping Multilingual Metadata Extraction: A Showcase in Cyrillic

机译:引导多语言元数据提取:西里尔文展示

获取原文
获取外文期刊封面目录资料

摘要

Applications based on scholarly data are of ever increasing importance. This results in disadvantages for areas where high-quality data and compatible systems are not available, such as non-English publications. To advance the mitigation of this imbalance, we use Cyrillic script publications from the CORE collection to create a high-quality data set for metadata extraction. We utilize our data for training and evaluating sequence labeling models to extract title and author information. Retraining GROBID on our data, we observe significant improvements in terms of precision and recall and achieve even better results with a self developed model. We make our data set covering over 15,000 publications as well as our source code freely available.
机译:基于学术数据的应用越来越重要。这导致了在无法获得高质量数据和兼容系统的领域(如非英语出版物)的劣势。为了进一步缓解这种不平衡,我们使用核心集合中的西里尔文字脚本出版物来创建用于元数据提取的高质量数据集。我们利用我们的数据来训练和评估序列标记模型,以提取标题和作者信息。根据我们的数据对GROBID进行再培训,我们发现在精确度和召回率方面有了显著的提高,并通过自行开发的模型取得了更好的结果。我们使我们的数据集覆盖15000多份出版物,以及我们的源代码免费提供。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号