...
首页> 外文期刊>Engineering Economics >Character-Based Machine Learning vs. Language Modeling for Diacritics Restoration
【24h】

Character-Based Machine Learning vs. Language Modeling for Diacritics Restoration

机译:基于字符的机器学习与语言模型的变音符号还原

获取原文

摘要

In this research we compare two approaches, in particular, character-based machine learning and language-modeling and offer the best solution for the diacritization problem solving. Parameters of tested approaches (i.e., a huge variety of feature types for the character-based method and a value n for the n-gram language-modeling method) were tuned to achieve the highest possible accuracy. Despite the main focus is on the Lithuanian language, we posit that obtained findings can also be applied to other, similar (Latvian or Slavic) languages. During experiments we measured the performance of approaches on 10 domains (including normative texts and non-normative Internet comments). The best results reaching ~99.5% and ~98.4% of the accuracy on characters and words, respectively, were achieved with the tri-gram language modeling method. It outperformed the character-based machine learning approach with an optimal composed feature set by ~1.4% and ~3.8% of the accuracy on characters and words, respectively.DOI: http://dx.doi.org/10.5755/j01.itc.46.4.18066.
机译:在这项研究中,我们比较了两种方法,特别是基于字符的机器学习和语言建模,并为解决双歧化问题提供了最佳解决方案。调整了测试方法的参数(即,基于字符的方法的特征类型种类繁多,针对n-gram语言建模方法的值n)进行了调整,以实现最高的准确性。尽管主要关注立陶宛语,但我们认为获得的发现也可以应用于其他类似(拉脱维亚语或斯拉夫语)语言。在实验过程中,我们测量了10个领域(包括规范文本和非规范Internet注释)方法的性能。使用三元语法语言建模方法时,分别达到了大约99.5%和〜98.4%的字符和单词精度的最佳结果。它在基于字符的机器学习方法上具有最佳的组合特征集,其特征集分别达到了字符和单词的〜1.4%和〜3.8%的精度.DOI:http://dx.doi.org/10.5755/j01.itc .46.4.18066。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号