首页> 外文期刊>International journal of applied science & computations >N-Gram: a Method of Conflating Terms An Approach to Text Categorization and Question Answering Systems in the Arabic language
【24h】

N-Gram: a Method of Conflating Terms An Approach to Text Categorization and Question Answering Systems in the Arabic language

机译:N-Gram:术语混用的方法阿拉伯文本分类和问答系统的方法

获取原文
获取原文并翻译 | 示例
           

摘要

Our main application program walks through the implementation of theN-Gram technique for Question Answering Systems. The goal of this program is to try to find a paragraph in an Arabic document that can serve as an answer to a question. The implementation uses the Prolog Language. The overall idea is coupling an information retrieval system with a shallow approach to natural language processing. The essential first step in accomplishing this task is the categorization of texts. We mean that for search purposes the search must be guided toward only the related categories: say science, medicine, social problems? Society? history, and other vital categories. Our paper proceeds to attack this vital step, which must be handled as a separate task. We know that this task is already completed in a typical English corpus, such as, for example, the TREC-8 context. We describe the categorization of documents in detail and we also give an overview of advanced topics in this domain. The user asks a question in unstructured language but with a careful choice of words, since document categorization is based on word occurrence information. To process the user's question we use mainly the N-gram, but to enhance the process for high occurrences success we remove some known suffixes, numbers, English words, and others, which are called Stop-Words. This process of removing words is called normalizatioa For simplicity we assumed that the collection of targeted documents is identified ahead of time r and stored as a text file. The rest of the words forming the question are farther processed by the body of our program, which sues N-Grams to compute the similarity between a word and other words from a paragraph of a selected document. Based on the similarity results, we may assign a value. Depending on the values for each word, a selected paragraph may be returned as an answer.
机译:我们的主要应用程序遍历了用于问答系统的N-Gram技术的实现。该程序的目标是尝试在阿拉伯语文档中找到一个可以回答问题的段落。该实现使用Prolog语言。总体思路是将信息检索系统与自然语言处理的浅层方法相结合。完成此任务的基本第一步是文本的分类。我们的意思是出于搜索目的,搜索必须仅针对相关类别:科学,医学,社会问题?社会?历史和其他重要类别。我们的论文着手攻击这一至关重要的步骤,必须将其作为单独的任务来处理。我们知道,该任务已经在典型的英语语料库中完成,例如TREC-8上下文。我们将详细描述文档的分类,并且还将概述该领域的高级主题。由于文档分类是基于单词出现信息,因此用户可以使用非结构化语言提问,但要谨慎选择单词。为了处理用户的问题,我们主要使用N-gram,但是为了增强高成功率的过程,我们删除了一些已知的后缀,数字,英文单词以及其他被称为Stop-Words的单词。删除单词的过程称为normalizatioa。为简单起见,我们假设目标文档的集合在时间r之前被标识并存储为文本文件。构成问题的其余单词将由我们的程序主体进一步处理,该程序会起诉N-Grams以计算所选文档段落中某个单词与其他单词之间的相似度。根据相似性结果,我们可以分配一个值。根据每个单词的值,选定的段落可能会作为答案返回。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号