首页> 外文OA文献 >TreeBoost.MH: a boosting algorithm for multi-label hierarchical text categorization
【2h】

TreeBoost.MH: a boosting algorithm for multi-label hierarchical text categorization

机译:TreeBoost.MH:用于多标签层次文本分类的增强算法

摘要

Hierarchical Text Categorization (HTC) is the task of generating (usually by means of supervised learning algorithms) text classifiers that operate on hierarchically structured classification schemes. Notwithstanding the fact that most large-sized classification schemes for text have a hierarchical structure, so far the attention of text classification researchers has mostly focused on algorithms for ``flatu27u27 classification, i.e. algorithms that operate on non-hierarchical classification schemes. These algorithms, once applied to a hierarchical classification problem, are not capable of taking advantage of the information inherent in the class hierarchy, and may thus be suboptimal, in terms of efficiency and/or effectiveness. In this paper we propose TreeBoost.MH, a multi-label HTC algorithm consisting of a hierarchical variant of AdaBoost.MH, a very well-known member of the family of ``boostingu27u27 learning algorithms. TreeBoost.MH embodies several intuitions that had arisen before within HTC: e.g. the intuitions that both feature selection and the selection of negative training examples should be performed ``locallyu27u27, i.e. by paying attention to the topology of the classification scheme. It also embodies the novel intuition that the weight distribution that boosting algorithms update at every boosting round should likewise be updated ``locallyu27u27. All these intuitions are embodied within TreeBoost.MH in an elegant and simple way, i.e. by defining TreeBoost.MH as a recursive algorithm that uses AdaBoost.MH as its base step, and that recurs over the tree structure. We present the results of experimenting TreeBoost.MH on two HTC benchmarks, and discuss analytically its computational cost.
机译:分层文本分类(HTC)是生成(通常通过监督学习算法的)文本分类器的任务,该文本分类器在分层结构的分类方案上运行。尽管大多数大型文本分类方案都具有分层结构,但到目前为止,文本分类研究人员的注意力主要集中在``flat u27 u27分类''的算法上,即在非分层分类方案下运行的算法。这些算法一旦应用于层次分类问题,就无法利用类层次结构中固有的信息,因此就效率和/或有效性而言可能不是最优的。在本文中,我们提出了TreeBoost.MH,这是一种多标签HTC算法,由AdaBoost.MH的分层变体组成,AdaBoost.MH是``增强 u27 u27学习算法''家族中非常知名的成员。 TreeBoost.MH体现了HTC之前出现的几种直觉:直觉上,特征选择和否定训练示例的选择都应``局部执行'',即通过注意分类方案的拓扑结构。这也体现了一种新颖的直觉,即在每个助推回合中,助推算法更新的权重分布也应``在本地 u27 u27进行更新''。所有这些直觉都以一种优雅而简单的方式体现在TreeBoost.MH中,即通过将TreeBoost.MH定义为以AdaBoost.MH为基础步骤并在树结构上重复进行的递归算法。我们介绍了在两个HTC基准上测试TreeBoost.MH的结果,并分析了其计算成本。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号