首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Distributional Features for Text Categorization
【24h】

Distributional Features for Text Categorization

机译:文本分类的分发功能

获取原文
获取原文并翻译 | 示例

摘要

Text categorization is the task of assigning predefined categories to natural language text. With the widely used 'bag of words' representation, previous researches usually assign a word with values such that whether this word appears in the document concerned or how frequently this word appears. Although these values are useful for text categorization, they have not fully expressed the abundant information contained in the document. This paper explores the effect of other types of values, which express the distribution of a word in the document. These novel values assigned to a word are called {it distributional features}, which include the compactness of the appearances of the word and the position of the first appearance of the word. The proposed distributional features are exploited by a {it tfidf} style equation and different features are combined using ensemble learning techniques. Experiments show that the distributional features are useful for text categorization. In contrast to using the traditional term frequency values solely, including the distributional features requires only a little additional cost, while the categorization performance can be significantly improved. Further analysis shows that the distributional features are especially useful when documents are long and the writing style is casual.
机译:文本分类是为自然语言文本分配预定义类别的任务。利用广泛使用的“词袋”表示法,以前的研究通常给一个词赋值,以使该词出现在有关文档中还是该词出现的频率。尽管这些值对于文本分类很有用,但它们并未完全表达文档中包含的大量信息。本文探讨了其他类型的值的影响,这些值表示文档中单词的分布。这些分配给单词的新颖值称为{it distributional features},它包括单词外观的紧凑性和单词首次出现的位置。建议的分布特征可通过{it tfidf}样式方程进行开发,并使用集成学习技术组合不同的特征。实验表明,分布特征对于文本分类很有用。与仅使用传统术语频率值(包括分布特征)相比,只需要一点点额外成本,而分类性能却可以得到显着改善。进一步的分析表明,当文档较长且书写风格随意时,分布特征特别有用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号