您现在的位置:首页> 外文会议>Annual meeting of the Society for Computation in Linguistics >文献详情

【6h】Unsupervised Learning of Cross-Lingual Symbol Embeddings Without Parallel Data

机译没有并行数据的跨语言符号嵌入的无监督学习

【摘要】We present a new method for unsupervised learning of multilingual symbol (e.g. character) embeddings, without any parallel data or prior knowledge about correspondences between languages. It is able to exploit similarities across languages between the distributions over symbols' contexts of use within their language, even in the absence of any symbols in common to the two languages. In experiments with an artificially corrupted text corpus, we show that the method can retrieve character correspondences obscured by noise. We then present encouraging results of applying the method to real linguistic data, including for low-resourced languages. The learned representations open the possibility of fully unsupervised comparative studies of text or speech corpora in low-resourced languages with no prior knowledge regarding their symbol sets.

【摘要机译】我们提出了一种无监督学习多语言符号(例如字符)嵌入的新方法,而无需任何并行数据或有关语言之间对应关系的先验知识。即使在没有两种语言共有的符号的情况下,它也能够利用符号在其语言中使用上下文的分布之间的跨语言相似性。在使用人为破坏的文本语料库进行的实验中,我们证明了该方法可以检索被噪音遮盖的字符对应。然后,我们提出了将该方法应用于真实语言数据(包括资源匮乏的语言)的令人鼓舞的结果。习得的表示法为完全没有监督的情况下,对资源不足的语言中的文本或语音语料库进行比较研究提供了可能性,而无需事先了解其符号集。

【作者】Mark Granroth-Wilding;Hannu Toivonen;

【作者单位】University of Helsinki; University of Helsinki;

【年(卷),期】2019(),

【年度】2019

【页码】19-28

【总页数】10

【正文语种】eng

【中图分类】;

【关键词】

  • 相关文献