Combining Embeddings of Input Data for Text Classification

Parcheta Zuzanna; Sanchis-Trilles German; Casacuberta Francisco; Rendahl Robin

首页> 外文期刊>Neural processing letters >Combining Embeddings of Input Data for Text Classification

【24h】

Combining Embeddings of Input Data for Text Classification

机译：结合输入数据的嵌入文本分类

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

The problem of automatic text classification is an essential part of text analysis. The improvement of text classification can be done at different levels such as a preprocessing step, network implementation, etc. In this paper, we focus on how the combination of different methods of text encoding may affect classification accuracy. To do this, we implemented a multi-input neural network that is able to encode input text using several text encoding techniques such as BERT, neural embedding layer, GloVe, skip-thoughts and ParagraphVector. The text can be represented at different levels of tokenised input text such as the sentence level, word level, byte pair encoding level and character level. Experiments were conducted on seven datasets from different language families: English, German, Swedish and Czech. Some of those languages contain agglutinations and grammatical cases. Two out of seven datasets originated from real commercial scenarios: (1) classifying ingredients into their corresponding classes by means of a corpus provided byNorthfork; and (2) classifying texts according to the English level of their corresponding writers by means of a corpus provided byProvenWord. The developed architecture achieves an improvement with different combinations of text encoding techniques depending on the different characteristics of the datasets. Once the best combination of embeddings at different levels was determined, different architectures of multi-input neural networks were compared. The results obtained with the best embedding combination and best neural network architecture were compared with state-of-the-art approaches. The results obtained with the dataset used in the experiments were better than the state-of-the-art baselines.

机译：自动文本分类的问题是文本分析的重要组成部分。文本分类的改善可以在不同的水平下完成，例如预处理步骤，网络实现等。在本文中，我们专注于不同文本编码方法的组合如何影响分类准确性。为此，我们实现了一种多输入神经网络，能够使用诸如BERT，神经嵌入层，手套，跳过思想和段落传感器的多个文本编码技术来编码输入文本。文本可以在不同级别的令牌输入文本中表示，例如句子级别，字级，字节对编码级别和字符级别。在不同语言系列的七个数据集上进行实验：英语，德语，瑞典和捷克语。其中一些语言包含凝聚和语法案例。七个数据集中的两个来自真正的商业场景：（1）通过提供Bynorthfork的语料库将食材分类为它们的相应课程; （2）根据普遍提供的语料库，根据其相应作家的英语水平进行分类文本。根据数据集的不同特征，开发的体系结构实现了文本编码技术的不同组合的改进。一旦确定了不同级别的嵌入品的最佳组合，比较了不同的多输入神经网络的不同架构。与最佳嵌入组合和最佳神经网络架构获得的结果与最先进的方法进行了比较。使用实验中使用的数据集获得的结果优于最先进的基线。

著录项

来源
《Neural processing letters》 |2021年第5期|3123-3151|共29页
作者
Parcheta Zuzanna; Sanchis-Trilles German; Casacuberta Francisco; Rendahl Robin;
展开▼
作者单位

Sciling SL Carrer del Riu 321 Pinedo 46012 Spain;

Sciling SL Carrer del Riu 321 Pinedo 46012 Spain;

PRHLT Res Ctr Camino Vera S-N Valencia 46022 Spain;

Northfork Regeringsgatan 65 S-11156 Stockholm Sweden;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Text classification; Multi-input network; Agglutinative language; Inflected language; Embedding combination;

机译：文本分类;多输入网络;凝聚语言;形成的语言;嵌入组合;

相似文献

外文文献
中文文献
专利

1. Deep text classification of Instagram data using word embeddings and weak supervision [J] . Hammar Kim, Jaradat Shatha, Dokoohaki Nima, Web Intelligence . 2020,第1期

机译：使用Word Embeddings和弱监管的Instagram数据的深文本分类
2. Combining text mining and data mining for bug report classification [J] . Yu Zhou, Yanxiang Tong, Ruihang Gu, Journal of Software Maintenance and Evolution . 2016,第3期

机译：结合文本挖掘和数据挖掘进行错误报告分类
3. Combining Bayesian Text Classification and Shrinkage to Automate Healthcare Coding: A Data Quality Analysis [J] . EITEL J. M. LAURIA, ALAN D. MARCH ACM journal of data and information quality . 2010,第3期

机译：结合贝叶斯文本分类和收缩来自动执行医疗保健编码：数据质量分析
4. Combining Dual Word Embeddings with Open Directory Project Based Text Classification [C] . Dinara Aliyeva, Kang-Min Kim, Byung-Ju Choi, IEEE International Conference on Cognitive Informatics Cognitive Computing . 2018

机译：将双词嵌入与基于开放目录项目的文本分类相结合
5. Things and Strings and More: Improving Place Name Disambiguation from Short Texts by Combining Entity Co-Occurrence, Topic Modeling, and Word Embedding [D] . Ju, Yiting. 2017

机译：事物和字符串和更多：通过组合实体共同发生，主题建模和单词嵌入来改善从短文本的歧义
6. Event-Dataset: Temporal information retrieval and text classification dataset [O] . Shafiq Ur Rehman Khan, Muhammad Arshad Islam 2019

机译：事件数据集：时间信息检索和文本分类数据集
7. Dataless Short Text Classification Based on Biterm Topic Model and Word Embeddings [O] . Yi Yang, Hongan Wang, Jiaqi Zhu, 2020

机译：基于Biterm主题模型和Word Embeddings的DataLess简短文本分类

Combining Embeddings of Input Data for Text Classification

摘要

著录项

相似文献

相关主题

期刊订阅