Morphological Segmentation to Improve Crosslingual Word Embeddings for Low Resource Languages

Chimalamarri Santwana; Sitaram Dinkar; Jain Ashritha

首页> 外文期刊>ACM transactions on Asian language information processing >Morphological Segmentation to Improve Crosslingual Word Embeddings for Low Resource Languages

【24h】

Morphological Segmentation to Improve Crosslingual Word Embeddings for Low Resource Languages

机译：改善低资源语言的跨性词嵌入的形态分割

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Crosslingual word embeddings developed from multiple parallel corpora help in understanding the relationships between languages and improving the prediction quality of machine translation. However, in low resource languages with complex and agglutinative morphologies, inducing good-quality crosslingual embeddings becomes challenging due to the problem of complex morphological forms and rare words. This is true even for languages that share common linguistic structure. In our work, we have shown that performing a simple morphological segmentation upon the corpora prior to the generation of crosslingual word embeddings for both roots and suffixes greatly improves the prediction quality and captures semantic similarities more effectively. To exhibit this, we have chosen two related languages: Telugu and Kannada of the Dravidian language family. We have also tested our method upon a widely spoken North Indian language, Hindi, belonging to the Indo-European language family, and have observed encouraging results.

机译：从多个平行语料库中开发的Crosslingual Word Embeddings帮助了解语言之间的关系并提高机器翻译预测质量。然而，由于复杂和凝集形态的低资源语言，由于复杂的形态形式和稀有词语的问题，诱导良好质量的Crosslingual Embeddings变得挑战。即使对于共享常见语言结构的语言，这也是如此。在我们的工作中，我们已经表明，在生成奇妙的单词嵌入之前对TOORS和后缀进行了简单的形态细分，大大提高了预测质量，更有效地捕获语义相似之处。为了展示这一点，我们选择了两种相关语言：Telugu和Dravidian语言家庭的kannada。我们还在北方印度语文，印地语的广泛口语，属于印度欧洲语言家庭，并观察到令人鼓舞的结果。

著录项

来源
《ACM transactions on Asian language information processing》 |2020年第5期|69.1-69.15|共15页
作者
Chimalamarri Santwana; Sitaram Dinkar; Jain Ashritha;
展开▼
作者单位

PES Univ Ctr Cloud Comp & BigData Banashankari 3rd Stage 100 Ft Rd Bangalore 560085 Karnataka India;

PES Univ Ctr Cloud Comp & BigData Banashankari 3rd Stage 100 Ft Rd Bangalore 560085 Karnataka India;

PES Univ Ctr Cloud Comp & BigData Banashankari 3rd Stage 100 Ft Rd Bangalore 560085 Karnataka India;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Word embeddings; word2vec; crosslingual embeddings; machine translation; morphology; morphologically rich languages; bilingual embeddings; supervised learning; linear transformation;

机译：Word Embeddings;Word2Vec;Crosslingual Embeddings;机器翻译;形态;形态上丰富的语言;双语嵌入式;监督学习;线性转换;

相似文献

外文文献
中文文献
专利

1. Improving Word Embedding Coverage in Less-Resourced Languages Through Multi-Linguality and Cross-Linguality: A Case Study with Aspect-Based Sentiment Analysis [J] . Akhtar Md Shad, Sawant Palaash, Sen Sukanta, ACM transactions on Asian language information processing . 2019,第2期

机译：通过多语言和跨语言提高资源较少的语言的词嵌入覆盖率：基于方面的情感分析的案例研究
2. Co-occurrence Weight Selection in Generation of Word Embeddings for Low Resource Languages [J] . Yucesoy Veysel, Koc Aykut ACM transactions on Asian language information processing . 2019,第3期

机译：低资源语言词嵌入生成中的共现权重选择
3. Low Resource Keyword Search With Synthesized Crosslingual Exemplars [J] . Yusuf Bolaji, Gundogdu Batuhan, Saraclar Murat Audio, Speech, and Language Processing, IEEE/ACM Transactions on . 2019,第7期

机译：具有综合的跨语言示例的低资源关键词搜索
4. Improving Named Entity Recognition for Morphologically Rich Languages Using Word Embeddings [C] . Demir Hakan, Ozgur Arzucan International Conference on Machine Learning and Applications . 2014

机译：使用词嵌入改进形态丰富语言的命名实体识别
5. Parallel Sentence Detection in Comparable Corpora with Bilingual Word Embeddings for Low-Resource Languages [D] . Cadigan, John. 2018

机译：与低资源语言的双语单词嵌入式的同类语料中的并行句子检测
6. Improving Loanword Identification in Low-Resource Language with Data Augmentation and Multiple Feature Fusion [O] . Chenggang Mi, Shaolin Zhu, Rui Nie 2021

机译：利用数据增强和多个特征融合在低资源语言中提高笔记识别
7. Improved Acoustic Word Embeddings for Zero-Resource Languages Using Multilingual Transfer [O] . Herman Kamper, Yevgen Matusevych, Sharon Goldwater 2021

机译：使用多语言传输改进了用于零资源语言的声学单词嵌入

Morphological Segmentation to Improve Crosslingual Word Embeddings for Low Resource Languages

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅