Unsupervised word embeddings capture latent knowledge from materials science literature

Tshitoyan Vahe; Dagdelen John; Weston Leigh; Dunn Alexander; Rong Ziqin; Kononova Olga; Persson Kristin A.; Ceder Gerbrand; Jain Anubhav

首页> 外文期刊>Nature >Unsupervised word embeddings capture latent knowledge from materials science literature

【24h】

Unsupervised word embeddings capture latent knowledge from materials science literature

机译：无监督词嵌入从材料科学文献中捕获潜在知识

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases(1,2), which encompass only a small fraction of the knowledge present in the research literature. Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors. To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing(3-10), which requires large hand-labelled datasets for training. Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings(11-13) (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure-property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.

机译：绝大多数科学知识都以文本形式发布，这很难通过传统的统计分析或现代的机器学习方法进行分析。相比之下，材料研究界的机器可解释数据的主要来源来自结构化属性数据库（1,2），该数据库仅包含研究文献中知识的一小部分。除属性值外，出版物还包含有关作者解释的数据项之间的联系和关系的宝贵知识。为了提高对这种知识的识别和利用，一些研究集中在使用有监督的自然语言处理（3-10）从科学文献中检索信息的过程中，这需要大量手工标记的数据集进行训练。在这里，我们表明出版文献中存在的材料科学知识可以有效地编码为信息密集的单词嵌入（11-13）（单词的矢量表示），而无需人工标记或监督。这些嵌入无需任何化学知识的明确插入，即可捕获复杂的材料科学概念，例如元素周期表的基础结构以及材料中的结构-特性关系。此外，我们证明了无监督方法可以在发现材料前几年推荐用于功能应用的材料。这表明有关未来发现的潜在知识在很大程度上已嵌入过去的出版物中。我们的发现凸显了以集体方式从大量科学文献中提取知识和关系的可能性，并指出了对科学文献进行挖掘的通用方法。

著录项

来源
《Nature》 |2019年第7763期|95-98|共4页
作者
Tshitoyan Vahe; Dagdelen John; Weston Leigh; Dunn Alexander; Rong Ziqin; Kononova Olga; Persson Kristin A.; Ceder Gerbrand; Jain Anubhav;
展开▼
作者单位

Lawrence Berkeley Natl Lab, Berkeley, CA 94720 USA|Google LLC, Mountain View, CA 94043 USA;

Lawrence Berkeley Natl Lab, Berkeley, CA 94720 USA|Univ Calif Berkeley, Dept Mat Sci & Engn, Berkeley, CA 94720 USA;

Lawrence Berkeley Natl Lab, Berkeley, CA 94720 USA;

Lawrence Berkeley Natl Lab, Berkeley, CA 94720 USA|Univ Calif Berkeley, Dept Mat Sci & Engn, Berkeley, CA 94720 USA;

Lawrence Berkeley Natl Lab, Berkeley, CA 94720 USA;

Univ Calif Berkeley, Dept Mat Sci & Engn, Berkeley, CA 94720 USA;

Lawrence Berkeley Natl Lab, Berkeley, CA 94720 USA|Univ Calif Berkeley, Dept Mat Sci & Engn, Berkeley, CA 94720 USA;

Lawrence Berkeley Natl Lab, Berkeley, CA 94720 USA|Univ Calif Berkeley, Dept Mat Sci & Engn, Berkeley, CA 94720 USA;

Lawrence Berkeley Natl Lab, Berkeley, CA 94720 USA;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);美国《生物学医学文摘》(MEDLINE);美国《化学文摘》(CA);
原文格式 PDF
正文语种 eng
中图分类
关键词
入库时间 2022-08-18 04:17:38

相似文献

外文文献
中文文献
专利

1. Unsupervised word embeddings capture latent knowledge from materials science literature [J] . Tshitoyan Vahe, Dagdelen John, Weston Leigh, Nature . 2019,第7763期

机译：无监督的Word Embeddings从材料科学文献中捕捉潜在知识
2. Adaptive cross-contextual word embedding for word polysemy with unsupervised topic modeling [J] . Li Shuangyin, Pan Rong, Luo Haoyu, Knowledge-Based Systems . 2021,第Apra22期

机译：与无监督主题建模的自适应交叉上下文词嵌入Word Polysemy
3. Unsupervised Word Segmentation and Lexicon Discovery Using Acoustic Word Embeddings [J] . Herman Kamper, Aren Jansen, Sharon Goldwater Audio, Speech, and Language Processing, IEEE/ACM Transactions on . 2016,第4期

机译：使用声词嵌入的无监督分词和词典发现
4. Unsupervised Learning with Word Embeddings Captures Quiescent Knowledge from COVID-19 Drugs Literature [C] . Tasnim Gharaibeh, Elise de Doncker International Conference on Computational Science and Computational Intelligence . 2020

机译：与Word Embeddings的无监督学习捕获来自Covid-19毒品文学的静态知识
5. Graph-based Latent Embedding, Annotation and Representation Learning in Neural Networks for Semi-supervised and Unsupervised Settings [D] . Kilinc, Ismail Ozsel. 2017

机译：半监督和非监督设置的神经网络中基于图的潜在嵌入，注释和表示学习
6. Looking through glass: Knowledge discovery from materials science literature using natural language processing [O] . Vineeth Venugopal, Sourav Sahoo, Mohd Zaki, 2021

机译：透过玻璃看：使用自然语言处理的材料科学文献知识发现
7. Unsupervised word embeddings capture latent knowledge from materials science literature [O] . Vahe Tshitoyan, John Dagdelen, Leigh Weston, 2019

机译：无监督的Word Embeddings从材料科学文献中捕捉潜在知识

Unsupervised word embeddings capture latent knowledge from materials science literature

摘要

著录项

相似文献

相关主题

期刊订阅