首页> 外文期刊>ACM transactions on Asian language information processing >CESS-A System to Categorize Bangla Web Text Documents
【24h】

CESS-A System to Categorize Bangla Web Text Documents

机译:CESS-A系统分类Bangla Web文本文档

获取原文
获取原文并翻译 | 示例

摘要

Technology has evolved remarkably, which has led to an exponential increase in the availability of digital text documents of disparate domains over the Internet. This makes the retrieval of the information a very much time- and resource-consuming task. Thus, a system that can categorize such documents based on their domains can truly help the users in obtaining the required information with relative ease and also reduce the workload of the search engines. This article presents a text categorization system (CESS) that categorizes text document using newly proposed hybrid features that combines term frequency-inverse document frequencyinverse class frequency and modified chi-square methods. Experiments were performed on real-world Bangla documents from eight domains comprises of 24,29,857 tokens, and the highest accuracy of 99.91% has been obtained withmultilayer perceptron-based classification. Also, the experiments were tested on Reuters-21578 and 20Newsgroups datasets and obtained accuracies of 97.29% and 94.67%, respectively, to showthe languageindependent nature of the system.
机译:技术已经显着发展,这导致了互联网上不同域的数字文本文件的可用性的指数增加。这使得信息是一种非常多的时间和资源消耗的任务。因此,可以基于其域对这些文档进行分类的系统可以真正帮助用户以相对容易地获得所需信息,并且还减少搜索引擎的工作量。本文介绍了一种文本分类系统(CESS),其使用新提出的混合特征对文本文档进行分类,该混合功能组合术语频率 - 逆文档频率频率和修改的Chi-Square方法。实验是在八个域的现实孟加拉文档上进行的,其中包括24,29,857个代币,最高准确性为99.91%,已经获得了基于Perceptron的分类。此外,在Reuters-21578和20Newsgroups数据集上测试了实验,并分别获得了97.29%和94.67%的精度,以显示系统的语言独立性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号