...
首页> 外文期刊>Computer Science and Information Systems >Hierarchical vs. flat n-gram-based text categorization: can we do better?
【24h】

Hierarchical vs. flat n-gram-based text categorization: can we do better?

机译:基于分层与基于n-gram的平面文本分类:我们可以做得更好吗?

获取原文
           

摘要

Hierarchical text categorization (HTC) refers to assigning a text document to one or more most suitable categories from a hierarchical category space. In this paper we present two HTC techniques based on kNN and SVM machine learning techniques for categorization process and byte n-gram based document representation. They are fully language independent and do not require any text preprocessing steps, or any prior information about document content or language. The effectiveness of the presented techniques and their language independence are demonstrated in experiments performed on five tree-structured benchmark category hierarchies that differ in many aspects: Reuters-Hier1, Reuters-Hier2, 15NGHier and 20NGHier in English and TanCorpHier in Chinese. The results obtained are compared with the corresponding flat categorization techniques applied to leaf level categories of the considered hierarchies. While kNN-based flat text categorization produced slightly better results than kNN-based HTC on the largest TanCorpHier and 20NGHier datasets, SVM-based HTC results do not considerably differ from the corresponding flat techniques, due to shallow hierarchies; still, they outperform both kNN-based flat and hierarchical categorization on all corpora except the smallest Reuters-Hier1 and Reuters-Hier2 datasets. Formal evaluation confirmed that the proposed techniques obtained state-of-the-art results.
机译:分层文本分类(HTC)是指将文本文档分配给分层类别空间中的一个或多个最合适的类别。在本文中,我们介绍了两种基于kNN和SVM机器学习技术的HTC技术,用于分类过程和基于字节n-gram的文档表示。它们完全独立于语言,不需要任何文本预处理步骤,也不需要任何有关文档内容或语言的先前信息。通过对五个树状结构的基准类别层次结构进行的实验,证明了所提出技术的有效性及其语言独立性,这些层次结构在许多方面都有所不同:英语为Reuters-Hier1,Reuters-Hier2、15NGHier和20NGHier,中文为TanCorpHier。将获得的结果与应用于所考虑的层次结构的叶级别类别的相应平面分类技术进行比较。虽然在最大的TanCorpHier和20NGHier数据集上,基于kNN的纯文本分类产生的结果比基于kNN的HTC稍好,但由于层次结构较浅,基于SVM的HTC结果与相应的平面技术没有太大差异;但是,除了最小的Reuters-Hier1和Reuters-Hier2数据集外,它们在所有语料库上均优于基于kNN的平面分类和层次分类。正式评估证实,所提出的技术获得了最新的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号