首页> 外文会议>International conference on artificial intelligence and soft computing >Feature Extraction in Subject Classification of Text Documents in Polish
【24h】

Feature Extraction in Subject Classification of Text Documents in Polish

机译:波兰语文本文档的主题分类中的特征提取

获取原文

摘要

In this work we evaluate two different methods for deriving features for a subject classification of text documents. The first method uses the standard Bag-of-Words (BoW) approach, which represents the documents with vectors of frequencies of selected terms appearing in the documents. This method heavily relies on the natural language processing (NLP) tools to properly preprocess text in the grammar- and inflection-conscious way. The second approach is based on the word-embedding technique recently proposed by Mikolov and does not require any NLP preprocessing. In this method the words are represented as vectors in continuous space and this representation of words is used to construct the feature vectors of the documents. We evaluate these fundamentally different approaches in the task of classification of Polish language Wikipedia articles with 34 subject areas. Our study suggests that the word-embedding based features seem to outperform the standard NLP-based features providing sufficiently large training dataset is available.
机译:在这项工作中,我们评估了两种不同的方法来推导文本文档的主题分类特征。第一种方法使用标准的单词袋(BoW)方法,该方法用文档中出现的选定词语的频率矢量表示文档。此方法在很大程度上依赖于自然语言处理(NLP)工具,以语法和注重变形的方式正确预处理文本。第二种方法基于Mikolov最近提出的词嵌入技术,不需要任何NLP预处理。在这种方法中,单词被表示为连续空间中的向量,并且单词的这种表示被用于构造文档的特征向量。我们在分类34个主题领域的波兰语Wikipedia文章的任务中评估了这些根本不同的方法。我们的研究表明,如果有足够大的训练数据集可用,基于词嵌入的功能似乎要优于基于NLP的标准功能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号