...
首页> 外文期刊>Neural processing letters >Combining Embeddings of Input Data for Text Classification
【24h】

Combining Embeddings of Input Data for Text Classification

机译:结合输入数据的嵌入文本分类

获取原文
获取原文并翻译 | 示例
           

摘要

The problem of automatic text classification is an essential part of text analysis. The improvement of text classification can be done at different levels such as a preprocessing step, network implementation, etc. In this paper, we focus on how the combination of different methods of text encoding may affect classification accuracy. To do this, we implemented a multi-input neural network that is able to encode input text using several text encoding techniques such as BERT, neural embedding layer, GloVe, skip-thoughts and ParagraphVector. The text can be represented at different levels of tokenised input text such as the sentence level, word level, byte pair encoding level and character level. Experiments were conducted on seven datasets from different language families: English, German, Swedish and Czech. Some of those languages contain agglutinations and grammatical cases. Two out of seven datasets originated from real commercial scenarios: (1) classifying ingredients into their corresponding classes by means of a corpus provided byNorthfork; and (2) classifying texts according to the English level of their corresponding writers by means of a corpus provided byProvenWord. The developed architecture achieves an improvement with different combinations of text encoding techniques depending on the different characteristics of the datasets. Once the best combination of embeddings at different levels was determined, different architectures of multi-input neural networks were compared. The results obtained with the best embedding combination and best neural network architecture were compared with state-of-the-art approaches. The results obtained with the dataset used in the experiments were better than the state-of-the-art baselines.
机译:自动文本分类的问题是文本分析的重要组成部分。文本分类的改善可以在不同的水平下完成,例如预处理步骤,网络实现等。在本文中,我们专注于不同文本编码方法的组合如何影响分类准确性。为此,我们实现了一种多输入神经网络,能够使用诸如BERT,神经嵌入层,手套,跳过思想和段落传感器的多个文本编码技术来编码输入文本。文本可以在不同级别的令牌输入文本中表示,例如句子级别,字级,字节对编码级别和字符级别。在不同语言系列的七个数据集上进行实验:英语,德语,瑞典和捷克语。其中一些语言包含凝聚和语法案例。七个数据集中的两个来自真正的商业场景:(1)通过提供Bynorthfork的语料库将食材分类为它们的相应课程; (2)根据普遍提供的语料库,根据其相应作家的英语水平进行分类文本。根据数据集的不同特征,开发的体系结构实现了文本编码技术的不同组合的改进。一旦确定了不同级别的嵌入品的最佳组合,比较了不同的多输入神经网络的不同架构。与最佳嵌入组合和最佳神经网络架构获得的结果与最先进的方法进行了比较。使用实验中使用的数据集获得的结果优于最先进的基线。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号