首页> 外文会议>International Conference on Multimedia Modeling >Semantic and Morphological Information Guided Chinese Text Classification
【24h】

Semantic and Morphological Information Guided Chinese Text Classification

机译:语义形态信息指导的中文文本分类

获取原文

摘要

Recently proposed models such as BERT, perform well in many text processing tasks. They get context-sensitive features, which is a good semantic for word sense disambiguation, through deeper layer and a large number of texts. But, for Chinese text classification, majority of datasets are crawled from social networking sites, these datasets are semantically complex and variable. How much data is needed to pre-train these models in order for them to grasp semantic features and understand context is a question. In this paper, we propose a novel shallow layer language model, which uses sememe information to guide model to grasp semantic information without a large number of pre-trained data. Then, we use the Chinese character representations generated from this model to do text classification. Furthermore, in order to make Chinese as easy to initialize as English, we employ convolution neural networks over Chinese strokes to get Chinese character structure initialization for our model. This model pre-trains on a part of the Chinese Wikipedia dataset, and we use the representations generated by this pre-trained model to do text classification. Experiments on text classification datasets show our model outperforms other state-of-arts models by a large margin. Also, our model is superior in terms of interpretability due to the introduction of semantic and morphological information.
机译:最近提出的模型(例如BERT)在许多文本处理任务中表现良好。它们具有上下文相关的功能,通过更深的层次和大量的文本,这是消除词义歧义的良好语义。但是,对于中文文本分类,大多数数据集都来自社交网站,这些数据集在语义上是复杂且可变的。要预训练这些模型以使它们掌握语义特征并理解上下文,需要多少数据是一个问题。在本文中,我们提出了一种新颖的浅层语言模型,该模型使用音素信息来指导模型来掌握语义信息,而无需大量的预训练数据。然后,我们使用从该模型生成的汉字表示法进行文本分类。此外,为了使中文像英语一样容易初始化,我们在中文笔画上使用了卷积神经网络来为我们的模型获取汉字结构的初始化。该模型在中文维基百科数据集的一部分上进行了预训练,并且我们使用由该预训练模型生成的表示形式进行文本分类。文本分类数据集上的实验表明,我们的模型在很大程度上优于其他最新模型。此外,由于引入了语义和形态信息,我们的模型在可解释性方面也很出色。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号