首页> 外文期刊>ACM transactions on Asian language information processing >Co-occurrence Weight Selection in Generation of Word Embeddings for Low Resource Languages
【24h】

Co-occurrence Weight Selection in Generation of Word Embeddings for Low Resource Languages

机译:低资源语言词嵌入生成中的共现权重选择

获取原文
获取原文并翻译 | 示例
       

摘要

This study aims to increase the performance of word embeddings by proposing a new weighting scheme for co-occurrence counting. The idea behind this new family of weights is to overcome the disadvantage of distant appearing word pairs, which are indeed semantically close, while representing them in the co-occurrence counting. For high-resource languages, this disadvantage might not be effective due to the high frequency of co-occurrence. However, when there are not enough available resources, such pairs suffer from being distant. To favour such pairs, a weighting scheme based on a polynomial fitting procedure is proposed to shift the weights up for distant words while the weights of nearby words are left almost unchanged. The parameter optimization for new weights and the effects of the weighting scheme are analysed for the English, Italian, and Turkish languages. A small portion of English resources and a quarter of Italian resources are utilized for demonstration purposes, as if these languages are low-resource languages. Performance increase is observed in analogy tests when the proposed weighting scheme is applied to relatively small corpora (i.e., mimicking low-resource languages) of both English and Italian. To show the effectiveness of the proposed scheme in small corpora, it is also shown for a large English corpus that the performance of the proposed weighting scheme cannot outperform the original weights. Since Turkish is relatively a low-resource language, it is demonstrated that the proposed weighting scheme can increase the performance of both analogy and similarity tests when all Turkish Wikipedia pages are utilized as a corpus. The positive effect of the proposed scheme has also been demonstrated in a standard sentiment analysis task for the Turkish language.
机译:这项研究旨在通过提出一种用于共现计数的新加权方案来提高单词嵌入的性能。这个新的权重系列背后的想法是要克服遥远出现的单词对的缺点,它们实际上在语义上是接近的,同时在同现计数中表示它们。对于高资源语言,由于同时出现的频率很高,因此此缺点可能无效。但是,当没有足够的可用资源时,这样的对就很遥远。为了支持这样的对,提出了一种基于多项式拟合过程的加权方案,以将权重向上移至远处的词,而附近词的权重几乎保持不变。针对英语,意大利语和土耳其语,分析了新权重的参数优化和加权方案的效果。一小部分的英语资源和四分之一的意大利资源用于演示目的,就好像这些语言是低资源语言一样。当将拟议的加权方案应用于英语和意大利语的相对较小的语料库(即模仿低资源语言)时,在类比测试中观察到性能提高。为了显示所提出的方案在小语料库中的有效性,还显示了对于大型英语语料库,所提出的加权方案的性能不能超过原始权重。由于土耳其语是一种资源较少的语言,因此证明了当所有土耳其语Wikipedia页面都用作语料库时,建议的加权方案可以提高类比和相似性测试的性能。土耳其语的标准情感分析任务也证明了该方案的积极效果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号