首页> 外文期刊>IEICE Transactions on Information and Systems >A Machine Learning Approach For An Indonesian-english Cross Language Question Answering System
【24h】

A Machine Learning Approach For An Indonesian-english Cross Language Question Answering System

机译:印尼英语跨语言问答系统的机器学习方法

获取原文
获取原文并翻译 | 示例
       

摘要

We have built a CLQA (Cross Language Question Answering) system for a source language with limited data resources (e.g. Indonesian) using a machine learning approach. The CLQA system consists of four modules: question analyzer, keyword translator, passage retriever and answer finder. We used machine learning in two modules, the question classifier (part of the question analyzer) and the answer finder. In the question classifier, we classify the EAT (Expected Answer Type) of a question by using SVM (Support Vector Machine) method. Features for the classification module are basically the output of our shallow question parsing module. To improve the classification score, we use statistical information extracted from our Indonesian corpus. In the answer finder module, using an approach different from the common approach in which answer is located by matching the named entity of the word corpus with the EAT of question, we locate the answer by text chunking the word corpus. The features for the SVM based text chunking process consist of question features, word corpus features and similarity scores between the word corpus and the question keyword. In this way, we eliminate the named entity tagging process for the target document. As for the keyword translator module, we use an Indonesian-English dictionary to translate Indonesian keywords into English. We also use some simple patterns to transform some borrowed English words. The keywords are then combined in boolean queries in order to retrieve relevant passages using IDF scores. We first conducted an experiment using 2,837 questions (about 10% are used as the test data) obtained from 18 Indonesian college students. We next conducted a similar experiment using the NTCIR (NII Test Collection for IR Systems) 2005 CLQA task by translating the English questions into Indonesian. Compared to the Japanese-English and Chinese-English CLQA results in the NTCIR 2005, we found that our system is superior to others except for one system that uses a high data resource employing 3 dictionaries. Further, a rough comparison with two other Indonesian-English CLQA systems revealed that our system achieved higher accuracy score.
机译:我们已经使用机器学习方法为数据资源有限(例如印尼语)的源语言构建了CLQA(跨语言问题回答)系统。 CLQA系统由四个模块组成:问题分析器,关键字翻译器,段落检索器和答案查找器。我们在两个模块中使用了机器学习,问题分类器(问题分析器的一部分)和答案查找器。在问题分类器中,我们使用SVM(支持向量机)方法对问题的EAT(预期答案类型)进行分类。分类模块的功能基本上是我们浅层问题解析模块的输出。为了提高分类分数,我们使用了从印尼语料库中提取的统计信息。在答案查找器模块中,使用不同于通过将单词主体的命名实体与问题的EAT匹配来定位答案的常见方法的方法,我们通过对单词主体进行文本分块来定位答案。基于SVM的文本分块过程的功能包括问题功能,单词语料库功能以及单词语料库和问题关键字之间的相似度得分。通过这种方式,我们消除了目标文档的命名实体标记过程。至于关键字翻译器模块,我们使用印尼语-英语词典将印尼语关键字翻译成英语。我们还使用一些简单的模式来转换一些借来的英语单词。然后,在布尔查询中组合关键字,以便使用IDF分数检索相关段落。我们首先使用从18位印度尼西亚大学生那里获得的2837个问题(约占10%的测试数据)进行了实验。接下来,我们通过将英语问题翻译成印度尼西亚语,使用NTCIR(用于IR系统的NII测试库)2005 CLQA任务进行了类似的实验。与NTCIR 2005中日语-英语和中文-英语的CLQA结果相比,我们发现我们的系统优于其他系统,除了一种系统使用的高数据资源使用3个词典。此外,与其他两个印尼英语CLQA系统进行的粗略比较显示,我们的系统获得了更高的准确性得分。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号