首页> 外文会议>Workshop on Computational Approaches to Code Switching >Evaluating Word Embeddings for Indonesian-English Code-Mixed Text Based on Synthetic Data
【24h】

Evaluating Word Embeddings for Indonesian-English Code-Mixed Text Based on Synthetic Data

机译:基于合成数据的印尼-英语代码混合文本词嵌入评估

获取原文

摘要

Code-mixed texts are abundant, especially in social media, and poses a problem for NLP tools, which are typically trained on monolingual corpora. In this paper, we explore and evaluate different types of word embeddings for Indonesian-English code-mixed text. We propose the use of code-mixed embeddings, i.e. embeddings trained on code-mixed text. Because large corpora of code-mixed text are required to train embeddings, we describe a method for synthesizing a code-mixed corpus, grounded in literature and a survey. Using sentiment analysis as a case study, we show that code-mixed embeddings trained on synthesized data are at least as good as cross-lingual embeddings and better than monolingual embeddings.
机译:混合代码的文本非常丰富,尤其是在社交媒体中,这给NLP工具带来了问题,而NLP工具通常是在单语料库上进行训练的。在本文中,我们探索和评估了印度尼西亚-英语代码混合文本的不同类型的词嵌入。我们建议使用代码混合嵌入,即在代码混合文本上训练的嵌入。因为需要大量的代码混合文本语料库来训练嵌入,所以我们基于文献和调查来描述一种用于合成代码混合语料库的方法。使用情感分析作为案例研究,我们表明在合成数据上训练的代码混合嵌入至少与跨语言嵌入一样好,并且比单语言嵌入更好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号