首页> 外文期刊>IEICE Transactions on Information and Systems >A Corpus-Based Approach for Automatic Thai Unknown Word Recognition Using Boosting Techniques
【24h】

A Corpus-Based Approach for Automatic Thai Unknown Word Recognition Using Boosting Techniques

机译:基于词库的Boosting技术自动泰语未知单词识别

获取原文
获取原文并翻译 | 示例
           

摘要

While classification techniques can be applied for automatic unknown word recognition in a language without word boundary, it faces with the problem of unbalanced datasets where the number of positive unknown word candidates is dominantly smaller than that of negative candidates. To solve this problem, this paper presents a corpus-based approach that introduces a so-called group-based ranking evaluation technique into ensemble learning in order to generate a sequence of classification models that later collaborate to selec the most probable unknown word from multiple candidates. Given a class ification model, the group-based ranking evaluation (GRE) is applied to construde a training dataset for learning the succeeding model, by weighing each of its candidates according to their ranks and correctness when the candidates of an unknown word are considered as one group. A number of experiments have been conducted on a large Thai medical text to evaluate performance of the proposed group-based ranking evaluation approach, namely V-GRE, compared to the conventional naive Bayes classifier and our vanilla version without ensemble learning. As the result, the proposed method achieves an accuracy of 90.93+0.50% when the first rank is selected while it gains 97.26±0.26% when the top-ten candidates are considered, that is 8.45% and 6.79% improvement over the conventional record-based naive Bayes classifier and the vanilla version. Another result on applying only best features show 93.93±0.22% and up to 98.85 +0.15% accuracy for top-1 and top-10, respectively. They arc 3.97% and 9.78% improvement over naive Bayes and the vanilla version. Finally, an error analysis is given.
机译:虽然分类技术可以应用于没有单词边界的语言中的自动未知单词识别,但是它面临着数据集不平衡的问题,其中阳性未知单词候选者的数量主要少于否定候选者。为了解决这个问题,本文提出了一种基于语料库的方法,该方法将所谓的基于组的排名评估技术引入到集成学习中,以生成一系列分类模型,该分类模型随后协作以从多个候选中选择最可能出现的未知单词。给定分类模型,通过将基于每个单词的候选者的等级和正确性加权(当一个未知单词的候选者被认为是)时,基于组的排名评估(GRE)应用于构建用于学习后续模型的训练数据集。一组。与传统的朴素贝叶斯分类器和我们的未经集合学习的香草版本相比,已经在大量泰国医学文献上进行了许多实验,以评估所提出的基于组的排名评估方法即V-GRE的性能。结果,所提出的方法在选择第一名时达到了90.93 + 0.50%的精度,而在考虑到前十名的候选者时则获得了97.26±0.26%的精度,与传统记录相比提高了8.45%和6.79%。基于朴素的贝叶斯分类器和香草版本。仅应用最佳功能的另一个结果显示,top-1和top-10的准确度分别为93.93±0.22%和98.85 + 0.15%。与朴素贝叶斯和香草版本相比,它们分别提高了3.97%和9.78%。最后,给出了错误分析。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号