首页> 外文会议>International Joint Conference on Neural Networks >Subword Semantic Hashing for Intent Classification on Small Datasets
【24h】

Subword Semantic Hashing for Intent Classification on Small Datasets

机译:小数据集意图分类的子词语义散列

获取原文

摘要

In this paper, we introduce the use of Semantic Hashing as embedding for the task of Intent Classification and achieve state-of-the-art performance on three frequently used benchmarks. Intent Classification on a small dataset is a challenging task for data-hungry state-of-the-art Deep Learning based systems. Semantic Hashing is an attempt to overcome such a challenge and learn robust text classification. Current word embedding based methods [11], [13], [14] are dependent on vocabularies. One of the major drawbacks of such methods is out-of-vocabulary terms, especially when having small training datasets and using a wider vocabulary. This is the case in Intent Classification for chatbots, where typically small datasets are extracted from internet communication. Two problems arise with the use of internet communication. First, such datasets miss a lot of terms in the vocabulary to use word embeddings efficiently. Second, users frequently make spelling errors. Typically, the models for intent classification are not trained with spelling errors and it is difficult to think about ways in which users will make mistakes. Models depending on a word vocabulary will always face such issues. An ideal classifier should handle spelling errors inherently. With Semantic Hashing, we overcome these challenges and achieve state-of-the-art results on three datasets: Chatbot, Ask Ubuntu, and Web Applications [3]. Our benchmarks are available online.1
机译:在本文中,我们介绍了使用语义散列作为意图分类任务的嵌入方法,并在三个常用基准上实现了最新的性能。对于基于数据的最先进的深度学习系统,在小型数据集上进行意图分类是一项艰巨的任务。语义散列是一种尝试克服这种挑战并学习可靠的文本分类的尝试。当前基于词嵌入的方法[11],[13],[14]依赖于词汇表。这种方法的主要缺点之一是词汇不足,尤其是在训练数据集较小且词汇量较大的情况下。聊天机器人的意图分类就是这种情况,通常从互联网通信中提取少量数据集。使用互联网通信出现两个问题。首先,此类数据集会错过词汇表中的许多术语,从而无法有效地使用单词嵌入。其次,用户经常犯拼写错误。通常,意图分类模型没有经过拼写错误训练,很难考虑用户会犯错误的方式。取决于单词词汇的模型将始终面临此类问题。理想的分类器应固有地处理拼写错误。借助语义散列,我们克服了这些挑战,并在以下三个数据集上获得了最先进的结果:Chatbot,Ask Ubuntu和Web应用程序[3]。我们的基准可以在线获得。 1

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号