首页> 外文期刊>Natural language engineering >Compression versus traditional machine learning classifiers to detect code-switching in varieties and dialects: Arabic as a case study
【24h】

Compression versus traditional machine learning classifiers to detect code-switching in varieties and dialects: Arabic as a case study

机译:压缩与传统机器学习分类器检测品种和方言中的码切换:阿拉伯语作为一个案例研究

获取原文
获取原文并翻译 | 示例
       

摘要

The occurrence of code-switching in online communication, when a writer switches among multiple languages, presents a challenge for natural language processing tools, since they are designed for texts written in a single language. To answer the challenge, this paper presents detailed research on ways to detect code-switching in Arabic text automatically. We compare the prediction by partial matching (PPM) compression-based classifier, implemented in Tawa, and a traditional machine learning classifier sequential minimal optimization (SMO), implemented in Waikato Environment for Knowledge Analysis, working specifically on Arabic text taken from Facebook. Three experiments were conducted in order to: (1) detect code-switching among the Egyptian dialect and English; (2) detect code-switching among the Egyptian dialect, the Saudi dialect, and English; and (3) detect code-switching among the Egyptian dialect, the Saudi dialect, Modern Standard Arabic (MSA), and English. Our experiments showed that PPM achieved a higher accuracy rate than SMO with 99.8% versus 97.5% in the first experiment and 97.8% versus 80.7% in the second. In the third experiment, PPM achieved a lower accuracy rate than SMO with 53.2% versus 60.2%. Code-switching between Egyptian Arabic and English text is easiest to detect because Arabic and English are generally written in different character sets. It is more difficult to distinguish between Arabic dialects and MSA as these use the same character set, and most users of Arabic, especially Saudis and Egyptians, frequently mix MSA with their dialects. We also note that the MSA corpus used for training the MSA model may not represent MSA Facebook text well, being built from news websites. This paper also describes in detail the new Arabic corpora created for this research and our experiments.
机译:在线通信中的代码切换发生在多种语言中的作品交换机时,对自然语言处理工具提出挑战,因为它们是以单一语言编写的文本设计的。为了回答挑战,本文提出了关于自动检测阿拉伯文文本中的代码切换的方法的详细研究。我们将部分匹配(PPM)基于压缩的分类器的预测进行比较,在Tawa实现,以及传统的机器学习分类器顺序最小优化(SMO),在Waikato环境中实现,用于了解从Facebook拍摄的阿拉伯文文本上。进行了三个实验,以:(1)检测埃及方言和英语之间的码切换; (2)检测埃及方言,沙特方言和英语之间的代码切换; (3)检测埃及方言,沙特方言,现代标准阿拉伯语(MSA)以及英语的码切换。我们的实验表明,PPM在第一次实验中的99.8%与97.5%相比,PPM达到了更高的精度率,而第二个实验中的97.8%与80.7%相比。在第三个实验中,PPM达到比60.2%的53.2%的SMO较低的精度率。埃及阿拉伯语和英语文本之间的代码切换是最容易检测的,因为阿拉伯语和英语通常用不同的字符集写入。更难以区分阿拉伯语方言和MSA,因为这些人使用相同的字符集,而大多数阿拉伯语,特别是沙特和埃及人的用户经常将MSA与他们的方言混为一谈。我们还注意到用于训练MSA模型的MSA语料库可能不会良好地代表MSA Facebook文本,从新闻网站建造。本文还详细介绍了为本研究和我们的实验创建的新阿拉伯语。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号