首页> 外国专利> METHOD AND SYSTEM FOR CATEGORIZING ARABIC TEXT

METHOD AND SYSTEM FOR CATEGORIZING ARABIC TEXT

机译:阿拉伯文本分类的方法和系统

摘要

The present invention is directed to a system, method and computer program for categorizing Arabic documents based on the text content. More particularly, the invention is a frequency based method using a learning approach that exploits, Arabic lexical look-up, Arabic morphological analysis, and a number of interconnected Arabic linguistic filters, to categorize Arabic texts. The present Arabic text categorization method comprises two phases namely: the learning phase, and the automatic categorization phase. During the learning phase, lemma forms (called stems) of specific noun types are extracted from manually categorized Arabic texts and then filtered, using Arabic morphological analysis. Based on these lemma forms and on the normalized frequency of these lemma forms for each predefined category, it is possible to automatically assign new Arabic texts to predefined categories during the automatic text categorization phase. As a result, categorization of Arabic texts is more precise and less sensitive to noise than prior art solutions. The present invention relates to a method for automatically assigning Arabic texts to predefined categories supporting information retrieval. For example, the method can be used to filter out Arabic documents that are unlikely to contain extractable data and can be used to route Arabic texts to processing mechanisms that are category specific.
机译:本发明针对用于基于文本内容对阿拉伯文文档进行分类的系统,方法和计算机程序。更具体地,本发明是一种使用学习方法的基于频率的方法,该学习方法利用阿拉伯语词法查询,阿拉伯语形态学分析和许多互连的阿拉伯语语言过滤器来对阿拉伯语文本进行分类。当前的阿拉伯文本分类方法包括两个阶段,即学习阶段和自动分类阶段。在学习阶段,从人工分类的阿拉伯文本中提取特定名词类型的引理形式(称为词干),然后使用阿拉伯语形态分析进行过滤。基于这些引理形式以及针对每个预定义类别的这些引理形式的归一化频率,可以在自动文本分类阶段自动将新的阿拉伯语文本分配给预定义类别。结果,与现有技术的解决方案相比,阿拉伯文本的分类更加精确并且对噪声不那么敏感。本发明涉及一种用于将阿拉伯文本自动分配给支持信息检索的预定义类别的方法。例如,该方法可用于过滤掉不太可能包含可提取数据的阿拉伯文档,并可用于将阿拉伯文本路由到特定于类别的处理机制。

著录项

  • 公开/公告号IL173306D0

    专利类型

  • 公开/公告日2006-06-11

    原文格式PDF

  • 申请/专利权人 IBM CORPORATION;

    申请/专利号IL20060173306

  • 发明设计人

    申请日2006-01-23

  • 分类号G06Fnull/null;G06F17/30;

  • 国家 IL

  • 入库时间 2022-08-21 21:39:05

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号