首页> 外文期刊>Journal of informetrics >Monolingual and multilingual topic analysis using LDA and BERT embeddings
【24h】

Monolingual and multilingual topic analysis using LDA and BERT embeddings

机译:使用LDA和BERT Embeddings的单语和多语言主题分析

获取原文
获取原文并翻译 | 示例
           

摘要

Analyzing research topics offers potential insights into the direction of scientific development. In particular, analyzing multilingual research topics can help researchers grasp the evolution of topics globally, revealing topic similarity among scientific publications written in different languages. Most studies to date on topic analysis have been based on English language publications and have relied heavily on citation-based topic evolution analysis. However, since it can be challenging for English publications to cite non-English sources and since many languages do not offer English translations of abstracts, citation-based methodologies are not suitable for analyzing multilingual research topic relations. Since multilingual sentence embeddings can effectively preserve word semantics in multilingual translation tasks, a topic model based on multilingual sentence embeddings could potentially generate topic-word distributions for publications in multilingual analysis. In this paper, which is situated in the field of library and information science, we use multilingual pretrained Bidirectional Encoder Representations from Transformers (BERT) embeddings and the Latent Dirichlet Allocation (LDA) topic model to analyze topic evolution in monolingual and multilingual topic similarity settings. For each topic, we multiply its LDA probability value by the averaged tensor similarity of BERT embeddings to explore the evolution of the topic in scientific publications. As our proposed method does not rely on a machine translator or the author's subjective translation, it avoids confusion and misusages caused by either machine error or the author's subjectively chosen English keywords. Our results show that the proposed approach is well-suited to analyzing the scientific evolutions in monolingual and scientific multilingual topic similarity relations. (C) 2020 Elsevier Ltd. All rights reserved.
机译:分析研究主题提供了潜在的洞察科学发展方向。特别是,分析了多语言研究主题可以帮助研究人员掌握全球主题的演变,揭示以不同语言编写的科学出版物中的主题相似性。大多数关于主题分析的研究一直基于英语语言出版物,并依赖于基于引文的主题演变分析。然而,由于英语出版物可能具有挑战性,以引用非英语来源,因为许多语言不提供摘要的英语翻译,基于引文的方法不适合分析多语言研究主题关系。由于多语言句子嵌入可以在多语言翻译任务中有效地保护Word语义,因此基于多语言句子嵌入的主题模型可能会在多语言分析中可能为出版物产生主题字分布。在本文中,它位于图书馆和信息科学领域,我们使用来自变换器(BERT)嵌入的多语言预用双向编码器表示和潜在的Dirichlet分配(LDA)主题模型来分析单声道和多语言主题相似性设置中的主题演变。对于每个主题,我们将其LDA概率值乘以BERT Embeddings的平均张量相似度,以探索科学出版物中主题的演变。由于我们提出的方法不依赖于机器翻译或作者的主观翻译,因此它避免了由机器错误或作者主观选择的英语关键字引起的混淆和误解。我们的研究结果表明,该拟议的方法非常适合分析单语和科学多语言主题相似关系的科学演变。 (c)2020 elestvier有限公司保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号