A Corpus-Based Approach for Automatic Thai Unknown Word Recognition Using Boosting Techniques

Jakkrit TECHO; Cholwich NATTEE; Thanaruk THEERAMUNKONG

首页> 外文期刊>IEICE Transactions on Information and Systems >A Corpus-Based Approach for Automatic Thai Unknown Word Recognition Using Boosting Techniques

【24h】

A Corpus-Based Approach for Automatic Thai Unknown Word Recognition Using Boosting Techniques

机译：基于词库的Boosting技术自动泰语未知单词识别

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

While classification techniques can be applied for automatic unknown word recognition in a language without word boundary, it faces with the problem of unbalanced datasets where the number of positive unknown word candidates is dominantly smaller than that of negative candidates. To solve this problem, this paper presents a corpus-based approach that introduces a so-called group-based ranking evaluation technique into ensemble learning in order to generate a sequence of classification models that later collaborate to selec the most probable unknown word from multiple candidates. Given a class ification model, the group-based ranking evaluation (GRE) is applied to construde a training dataset for learning the succeeding model, by weighing each of its candidates according to their ranks and correctness when the candidates of an unknown word are considered as one group. A number of experiments have been conducted on a large Thai medical text to evaluate performance of the proposed group-based ranking evaluation approach, namely V-GRE, compared to the conventional naive Bayes classifier and our vanilla version without ensemble learning. As the result, the proposed method achieves an accuracy of 90.93+0.50% when the first rank is selected while it gains 97.26±0.26% when the top-ten candidates are considered, that is 8.45% and 6.79% improvement over the conventional record-based naive Bayes classifier and the vanilla version. Another result on applying only best features show 93.93±0.22% and up to 98.85 +0.15% accuracy for top-1 and top-10, respectively. They arc 3.97% and 9.78% improvement over naive Bayes and the vanilla version. Finally, an error analysis is given.

机译：虽然分类技术可以应用于没有单词边界的语言中的自动未知单词识别，但是它面临着数据集不平衡的问题，其中阳性未知单词候选者的数量主要少于否定候选者。为了解决这个问题，本文提出了一种基于语料库的方法，该方法将所谓的基于组的排名评估技术引入到集成学习中，以生成一系列分类模型，该分类模型随后协作以从多个候选中选择最可能出现的未知单词。给定分类模型，通过将基于每个单词的候选者的等级和正确性加权（当一个未知单词的候选者被认为是）时，基于组的排名评估（GRE）应用于构建用于学习后续模型的训练数据集。一组。与传统的朴素贝叶斯分类器和我们的未经集合学习的香草版本相比，已经在大量泰国医学文献上进行了许多实验，以评估所提出的基于组的排名评估方法即V-GRE的性能。结果，所提出的方法在选择第一名时达到了90.93 + 0.50％的精度，而在考虑到前十名的候选者时则获得了97.26±0.26％的精度，与传统记录相比提高了8.45％和6.79％。基于朴素的贝叶斯分类器和香草版本。仅应用最佳功能的另一个结果显示，top-1和top-10的准确度分别为93.93±0.22％和98.85 + 0.15％。与朴素贝叶斯和香草版本相比，它们分别提高了3.97％和9.78％。最后，给出了错误分析。

著录项

来源
《IEICE Transactions on Information and Systems》 |2009年第12期|2321-2333|共13页
作者
Jakkrit TECHO; Cholwich NATTEE; Thanaruk THEERAMUNKONG;
展开▼
作者单位

Information, Computer and Commu-nication Technology School, Sirindhorn International Institute of Technology, Thammasat University. Thailand;

Information, Computer and Commu-nication Technology School, Sirindhorn International Institute of Technology, Thammasat University. Thailand;

Information, Computer and Commu-nication Technology School, Sirindhorn International Institute of Technology, Thammasat University. Thailand;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
unknown word recognition; word boundary detection; data mining; machine learning; ensemble learning;

机译：未知词识别词边界检测;数据挖掘;机器学习整体学习;

相似文献

外文文献
中文文献
专利

1. Boosting-based ensemble learning with penalty profiles for automatic Thai unknown word recognition [J] . Jakkrit TeCho, Cholwich Nattee, Thanaruk Theeramunkong Computers & mathematics with applications . 2012,第6期

机译：基于提升的集成学习和惩罚配置文件，用于自动泰语未知单词识别
2. Automatic Microblog-Oriented Unknown Word Recognition with Unsupervised Method [J] . HUANG Degen, ZHANG Jing, HUANG Kaiyu 电子学报（英文版） . 2018,第001期

机译：基于无监督方法的面向微博的未知单词自动识别
3. Automatic Microblog-Oriented Unknown Word Recognition with Unsupervised Method [J] . HUANG Degen, ZHANG Jing, HUANG Kaiyu 电子学报：英文版 . 2018,第001期

机译：以无监督方法自动微博未知的单词识别
4. A Corpus-Based Approach for Automatic Thai Unknown Word Recognition using Ensemble Learning Techniques [C] . Jakkrit TeCho, Cholwich Nattee, Thanaruk Theeramunkong Advances in knowledge discovery and data mining . 2009

机译：使用集成学习技术的基于语料库的泰语未知单词自动识别方法
5. A multimodal fusion approach for automatic postal address recognition system using Optical Character Recognition (OCR) and Automatic Speech Recognition (ASR) techniques. [D] . Singh, Amriteshwar. 2011

机译：一种使用光学字符识别（OCR）和自动语音识别（ASR）技术的自动邮政地址识别系统的多模式融合方法。
6. Phonological and syntactic competition effects in spoken word recognition: evidence from corpus-based statistics [O] . Jie Zhuang, Barry J. Devereux -1

机译：语音识别中的语音和句法竞争效应：基于语料库的统计证据
7. Boosting-based ensemble learning with penalty profiles for automatic Thai unknown word recognition [O] . TeCho Jakkrit, Nattee Cholwich, Theeramunkong Thanaruk 2012

机译：基于提升的集成学习和惩罚配置文件，用于自动泰语未知单词识别
8. Is Word Recognition Automatic: A Cognitive-Anatomical Approach [R] . Posner, M. I., Sandson, J., Dhawan, M., 1988

机译：词识别自动：认知 - 解剖学方法

A Corpus-Based Approach for Automatic Thai Unknown Word Recognition Using Boosting Techniques

摘要

著录项

相似文献

相关主题

期刊订阅