Compression versus traditional machine learning classifiers to detect code-switching in varieties and dialects: Arabic as a case study

Taghreed Tarmom; William Teahan; Eric Atwell; Mohammad Ammar Alsalka

首页> 外文期刊>Natural language engineering >Compression versus traditional machine learning classifiers to detect code-switching in varieties and dialects: Arabic as a case study

【24h】

Compression versus traditional machine learning classifiers to detect code-switching in varieties and dialects: Arabic as a case study

机译：压缩与传统机器学习分类器检测品种和方言中的码切换：阿拉伯语作为一个案例研究

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

The occurrence of code-switching in online communication, when a writer switches among multiple languages, presents a challenge for natural language processing tools, since they are designed for texts written in a single language. To answer the challenge, this paper presents detailed research on ways to detect code-switching in Arabic text automatically. We compare the prediction by partial matching (PPM) compression-based classifier, implemented in Tawa, and a traditional machine learning classifier sequential minimal optimization (SMO), implemented in Waikato Environment for Knowledge Analysis, working specifically on Arabic text taken from Facebook. Three experiments were conducted in order to: (1) detect code-switching among the Egyptian dialect and English; (2) detect code-switching among the Egyptian dialect, the Saudi dialect, and English; and (3) detect code-switching among the Egyptian dialect, the Saudi dialect, Modern Standard Arabic (MSA), and English. Our experiments showed that PPM achieved a higher accuracy rate than SMO with 99.8% versus 97.5% in the first experiment and 97.8% versus 80.7% in the second. In the third experiment, PPM achieved a lower accuracy rate than SMO with 53.2% versus 60.2%. Code-switching between Egyptian Arabic and English text is easiest to detect because Arabic and English are generally written in different character sets. It is more difficult to distinguish between Arabic dialects and MSA as these use the same character set, and most users of Arabic, especially Saudis and Egyptians, frequently mix MSA with their dialects. We also note that the MSA corpus used for training the MSA model may not represent MSA Facebook text well, being built from news websites. This paper also describes in detail the new Arabic corpora created for this research and our experiments.

机译：在线通信中的代码切换发生在多种语言中的作品交换机时，对自然语言处理工具提出挑战，因为它们是以单一语言编写的文本设计的。为了回答挑战，本文提出了关于自动检测阿拉伯文文本中的代码切换的方法的详细研究。我们将部分匹配（PPM）基于压缩的分类器的预测进行比较，在Tawa实现，以及传统的机器学习分类器顺序最小优化（SMO），在Waikato环境中实现，用于了解从Facebook拍摄的阿拉伯文文本上。进行了三个实验，以：（1）检测埃及方言和英语之间的码切换; （2）检测埃及方言，沙特方言和英语之间的代码切换; （3）检测埃及方言，沙特方言，现代标准阿拉伯语（MSA）以及英语的码切换。我们的实验表明，PPM在第一次实验中的99.8％与97.5％相比，PPM达到了更高的精度率，而第二个实验中的97.8％与80.7％相比。在第三个实验中，PPM达到比60.2％的53.2％的SMO较低的精度率。埃及阿拉伯语和英语文本之间的代码切换是最容易检测的，因为阿拉伯语和英语通常用不同的字符集写入。更难以区分阿拉伯语方言和MSA，因为这些人使用相同的字符集，而大多数阿拉伯语，特别是沙特和埃及人的用户经常将MSA与他们的方言混为一谈。我们还注意到用于训练MSA模型的MSA语料库可能不会良好地代表MSA Facebook文本，从新闻网站建造。本文还详细介绍了为本研究和我们的实验创建的新阿拉伯语。

著录项

来源
《Natural language engineering》 |2020年第6期|663-676|共14页
作者
Taghreed Tarmom; William Teahan; Eric Atwell; Mohammad Ammar Alsalka;
展开▼
作者单位

School of Computing University of Leeds Leeds UK;

School of Computer Science and Electronic Engineering Bangor University Bangor UK;

School of Computing University of Leeds Leeds UK;

School of Computing University of Leeds Leeds UK;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Arabic; Corpus linguistics; Language resources; Machine learning; Sublanguages and controlled languages; Text segmentation;

机译：阿拉伯;语料库语言学;语言资源;机器学习;子语言和受控语言;文本分割;

相似文献

外文文献
中文文献
专利

1. A study on Arabic sign language recognition for differently abled using advanced machine learning classifiers [J] . Mustafa Mohammed Journal of ambient intelligence and humanized computing . 2021,第3期

机译：使用先进机器学习分类的不同鼓击语手语识别的研究
2. Arabic tweeps dialect prediction based on machine learning approach [J] . Khaled Alrifai, Ghaida Rebdawi, Nada Ghneim International Journal of Electrical and Computer Engineering . 2021,第2期

机译：阿拉伯语滴动基于机器学习方法的方言预测
3. A Neural Machine Translation Model for Arabic Dialects That Utilizes Multitask Learning (MTL) [J] . Laith H. Baniata, Seyoung Park, Seong-Bae Park Computational intelligence and neuroscience . 2018,第2期

机译：利用多任务学习（MTL）的阿拉伯语神经机器翻译模型
4. Voting Classifier vs Deep learning method in Arabic Dialect Identification [C] . Dhaou Ghoul, Gael Lejeune Workshop on Arabic Natural Language Processing . 2020

机译：投票分类器与阿拉伯语方言识别中的深度学习方法
5. Analog Circuits Based Machine Learning Classifier to Detect Counterfeit Currency Notes [D] . Madurai Narayanamurthy, Anish . 2020

机译：基于模拟电路的机器学习分类器，以检测伪造的货币票据
6. A Neural Machine Translation Model for Arabic Dialects That Utilises Multitask Learning (MTL) [O] . Laith H. Baniata, Seyoung Park, Seong-Bae Park 2018

机译：利用多任务学习（MTL）的阿拉伯语神经机器翻译模型
7. Compression versus traditional machine learning classifiers to detect code-switching in varieties and dialects: Arabic as a case study [O] . Taghreed Tarmom, William Teahan, Eric Atwell, 2020

机译：压缩与传统机器学习分类器检测品种和方言中的码切换：阿拉伯语作为一个案例研究

Compression versus traditional machine learning classifiers to detect code-switching in varieties and dialects: Arabic as a case study

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅