首页> 外文会议>2010 IEEE International Conference on Systems Man and Cybernetics >On a new model for automatic text categorization based on Vector Space Model
【24h】

On a new model for automatic text categorization based on Vector Space Model

机译:基于向量空间模型的文本自动分类新模型

获取原文

摘要

In our previous paper, we proposed a new classification technique called the Frequency Ratio Accumulation Method (FRAM). This is a simple technique that adds up the ratios of term frequencies among categories, and it is able to use index terms without limit. Then, we adopted the Character N-gram to form index terms, thereby improving FRAM. However, FRAM did not have a satisfactory mathematical basis. Therefore, we present here a new mathematical model based on a “Vector Space Model” and consider its implications. The proposed method is evaluated by performing several experiments. In these experiments, we classify newspaper articles from the English Reuters-21578 data set, a Japanese CD-Mainichi 2002 data set using the proposed method. The Reuters-21578 data set is a benchmark data set for automatic text categorization. It is shown that FRAM has good classification accuracy. Specifically, the micro-averaged F-measure of the proposed method is 92.2% for English. The proposed method can perform classification utilizing a single program and it is language-independent.
机译:在我们以前的论文中,我们提出了一种新的分类技术,称为频率比累积方法(FRAM)。这是一种简单的技术,可以将类别之间的术语频率比率相加,并且可以无限制地使用索引术语。然后,我们采用了字符N元语法来形成索引项,从而改善了FRAM。但是,FRAM没有令人满意的数学基础。因此,我们在这里提出一种基于“向量空间模型”的新数学模型,并考虑其含义。通过执行几次实验对提出的方法进行了评估。在这些实验中,我们使用建议的方法对来自英语Reuters-21578数据集(日本CD-Mainichi 2002数据集)中的报纸文章进行分类。 Reuters-21578数据集是用于自动文本分类的基准数据集。结果表明,FRAM具有良好的分类精度。具体来说,对于英语,该方法的微平均F测度为92.2%。所提出的方法可以利用单个程序执行分类,并且与语言无关。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号