首页> 外文会议>International Conference on Language Resources and Evaluation >Habibi - a multi Dialect multi National Arabic Song Lyrics Corpus
【24h】

Habibi - a multi Dialect multi National Arabic Song Lyrics Corpus

机译:Habibi - 一种多方言多国阿拉伯语歌曲歌词语料库

获取原文

摘要

This paper introduces Habibi the first Arabic Song Lyrics corpus. The corpus comprises more than 30,000 Arabic song lyrics in 6 Arabic dialects for singers from 18 different Arabic countries. The lyrics are segmented into more than 500,000 sentences (song verses) with more than 3.5 million words. 1 provide the corpus in both comma separated value (csv) and annotated plain text (txt) file formats. In addition, I converted the csv version into JavaScript Object Notation (json) and extensible Markup Language (xml) file formats. To experiment with the corpus I run extensive binary and multi-class experiments for dialect and country-of-origin identification. The identification tasks include the use of several classical machine learning and deep learning models utilising different word embeddings. For the binary dialect identification task the best performing classifier achieved a testing accuracy of 93%. This was achieved using a word-based Convolutional Neural Network (CNN) utilising a Continuous Bag of Words (CBOW) word embeddings model. The results overall show all classical and deep learning models to outperform our baseline, which demonstrates the suitability of the corpus for both dialect and country-of-origin identification tasks. I am making the corpus and the trained CBOW word embeddings freely available for research purposes.
机译:本文介绍了Habibi第一个阿拉伯语歌曲歌词语料库。核心人士包括来自18个不同阿拉伯国家的歌手的6个阿拉伯语方言中超过30,000名阿拉伯语歌词。歌词被分为超过50万句话(歌曲经文),超过350万字。 1以逗号分隔值(CSV)和注释的纯文本(TXT)文件格式提供语料库。此外,我将CSV版本转换为JavaScript对象表示法(JSON)和可扩展标记语言(XML)文件格式。试验语料库,我运行广泛的二元和多级实验,用于方言和原产国识别。识别任务包括利用不同词嵌入的多种经典机器学习和深度学习模型的使用。对于二进制方言识别任务,最好的执行分类器实现了93%的测试精度。这是使用基于词的卷积神经网络(CNN)来实现,利用连续的单词(CBow)Word Embeddings模型。结果总体上显示了所有经典和深度学习模型,以满足我们的基线,这表明了语料库的适用性以及原产地识别任务。我正在制作语料库和训练有素的Cowe Word Embeddings以获得研究目的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号