首页> 外文期刊>Applied Soft Computing >Smoothed n-gram based models for tweet language identification: A case study of the Brazilian and European Portuguese national varieties
【24h】

Smoothed n-gram based models for tweet language identification: A case study of the Brazilian and European Portuguese national varieties

机译:用于推文语言识别的平滑N-GRAM模型:巴西和欧洲葡萄牙民族品种的案例研究

获取原文
获取原文并翻译 | 示例
           

摘要

Identifying the language of a text is an important step for several natural language processing applications. State-of-the-art language identification (LID) systems perform very well when discriminating between unrelated languages on standard datasets. However, the LID task has a bottleneck when discriminating between similar languages or language varieties. Furthermore, LID has also proven to be very challenging when dealing with short texts such as the ones from Twitter. In this paper, we propose the use of smoothed n-gram language models to classify tweets in both Brazilian and European Portuguese variants. Word and character n-gram language models were combined and evaluated through five different classifiers. We have compared the smoothed n-gram language models together with the Term Frequency and Inverse Document Frequency weighting scheme. This paper also proposes an ensemble model, in which the class labels output were combined using majority voting and algebraic combiners. The best configuration reached accuracy of 92.71% using an ensemble model, which combines Lidstone (0.1) character 6-gram, Good-Turing word unigram, and Witten-Bell word bigram models, together with the Log-Likelihood Ratio estimation method. (C) 2017 Published by Elsevier B.V.
机译:识别文本的语言是几种自然语言处理应用程序的重要步骤。最先进的语言识别(LID)系统在标准数据集上的无关语言之间辨别时表现得非常好。但是,盖子任务在歧视类似语言或语言品种之间存在瓶颈。此外,在处理诸如Twitter之类的短篇文本时,盖子也被证明是非常具有挑战性的。在本文中,我们建议使用平滑的n-gram语言模型来对巴西和欧洲葡萄牙变体的推文进行分类。组合和字符N-GRAM语言模型并通过五种不同的分类器进行评估。我们已经将平滑的N-GRAM语言模型与术语频率和逆文档频率加权方案进行了比较。本文还提出了一个集合模型,其中使用大多数投票和代数组合使用类标签输出。使用集合模型,最佳配置达到了92.71%的准确性,该模型结合了Lidstone(0.1)字符6-Gram,良好的单词Unigram和Witten-Bell Word Bigram模型,以及对数似然比估计方法。 (c)2017年由Elsevier B.V发布。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号