【24h】

Automatic Kurdish Sorani text categorization using N-gram based model

机译:使用基于N-gram的模型对库尔德语Sorani进行自动文本分类

获取原文

摘要

N-gram Based Model for text categorization is applied for many languages, in particularly the Indo-European languages family. Regrettably, there is limit study found on applying the mentioned model for Kurdish Sorani Language. This paper presents the results of investigating N-gram frequency statistics technique to classify the Kurdish Sorani Unicode documents of online newspapers into their classes. The investigated technique generates the frequency profiles for the training and the test documents using N-gram word level 1 gram and character level (2, 3, 4, 5, 6, 7, and 8) grams as a text representation. Then, a similarity algorithm called “Dice measure of similarity” is employed in order to classify the documents. Results show that the character level (5 grams) gives better text representation which is led to achieve better text classification.
机译:基于N元语法的文本分类模型适用于多种语言,尤其是印欧语系。遗憾的是,在将上述模型用于库尔德索拉尼语语言时发现了有限的研究。本文介绍了调查N-gram频率统计技术以将在线报纸的库尔德Sorani Unicode文档分类到其类中的结果。研究的技术使用1克N语法词级和2,3、4、5、6、7和8字符级的N语法词生成文本,从而生成培训和测试文档的频率分布图。然后,采用一种称为“相似度的骰子度量”的相似度算法对文档进行分类。结果表明,字符级别(5克)可以提供更好的文本表示,从而可以实现更好的文本分类。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号