Feature Extraction in Subject Classification of Text Documents in Polish

机译：波兰语文本文档的主题分类中的特征提取

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this work we evaluate two different methods for deriving features for a subject classification of text documents. The first method uses the standard Bag-of-Words (BoW) approach, which represents the documents with vectors of frequencies of selected terms appearing in the documents. This method heavily relies on the natural language processing (NLP) tools to properly preprocess text in the grammar- and inflection-conscious way. The second approach is based on the word-embedding technique recently proposed by Mikolov and does not require any NLP preprocessing. In this method the words are represented as vectors in continuous space and this representation of words is used to construct the feature vectors of the documents. We evaluate these fundamentally different approaches in the task of classification of Polish language Wikipedia articles with 34 subject areas. Our study suggests that the word-embedding based features seem to outperform the standard NLP-based features providing sufficiently large training dataset is available.

机译：在这项工作中，我们评估了两种不同的方法来推导文本文档的主题分类特征。第一种方法使用标准的单词袋（BoW）方法，该方法用文档中出现的选定词语的频率矢量表示文档。此方法在很大程度上依赖于自然语言处理（NLP）工具，以语法和注重变形的方式正确预处理文本。第二种方法基于Mikolov最近提出的词嵌入技术，不需要任何NLP预处理。在这种方法中，单词被表示为连续空间中的向量，并且单词的这种表示被用于构造文档的特征向量。我们在分类34个主题领域的波兰语Wikipedia文章的任务中评估了这些根本不同的方法。我们的研究表明，如果有足够大的训练数据集可用，基于词嵌入的功能似乎要优于基于NLP的标准功能。

著录项

来源
《International conference on artificial intelligence and soft computing》|2018年|445-452|共8页
会议地点
作者
Tomasz Walkowiak; Szymon Datko; Henryk Maciejewski;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Text mining; Subject classification; Bag of words Word embedding; fastText;

机译：文本挖掘;学科分类;词袋词嵌入fastText;

相似文献

外文文献
中文文献
专利

1. Classification of Protein-Protein Interaction Full-Text Documents Using Text and Citation Network Features [J] . Kolchinsky Artemy, Abi-Haidar Alaa, Kaur Jasleen, Computational Biology and Bioinformatics, IEEE/ACM Transactions on . 2010,第3期

机译：使用文本和引文网络功能对蛋白质-蛋白质相互作用全文文档进行分类
2. Feature Extraction or Feature Selection for Text Classification: A Case Study on Phishing Email Detection [J] . Masoumeh Zareapoor, Seeja K. R International Journal of Information Engineering and Electronic Business . 2015,第2期

机译：用于文本分类的特征提取或特征选择：以网络钓鱼电子邮件检测为例
3. A Novel Inherent Distinguishing Feature Selector for Highly Skewed Text Document Classification [J] . Muhammad Sajid Ali, Kashif Javed Arabian Journal for Science and Engineering. Section A, Sciences . 2020,第12期

机译：用于高度偏斜文本文档分类的新型内在的区分特征选择器
4. Feature Extraction in Subject Classification of Text Documents in Polish [C] . Tomasz Walkowiak, Szymon Datko, Henryk Maciejewski International Conference on Artificial Intelligence and Soft Computing . 2018

机译：在抛光中文本文档的主题分类中的特征提取
5. Feature selection and extraction for text classification. [D] . Bakus, Jan. 2005

机译：用于文本分类的特征选择和提取。
6. Textractor: a hybrid system for medications and reason for their prescription extraction from clinical text documents [O] . Stéphane M Meystre, Julien Thibault, Shuying Shen, 2010

机译：Textractor：用于药物和从临床文本文档中提取处方的理由的混合系统
7. Document-base Extraction for Single-label Text Classification [O] . Yanbo J. Wang, Robert S, Frans Coenen, 2009

机译：单标签文本分类的基于文档的提取

Feature Extraction in Subject Classification of Text Documents in Polish

摘要

著录项

相似文献

相关主题

期刊订阅