首页> 外文会议>International Conference on Digital Information, Networking, and Wireless Communications >Text encoding for deep learning neural networks: A reversible base 64 (Tetrasexagesimal) Integer Transformation (RIT64) alternative to one hot encoding with applications to Arabic morphology
【24h】

Text encoding for deep learning neural networks: A reversible base 64 (Tetrasexagesimal) Integer Transformation (RIT64) alternative to one hot encoding with applications to Arabic morphology

机译:深度学习神经网络的文本编码:一种可逆的base 64(四正整数)整数转换(RIT64),可替代一种热编码,并应用于阿拉伯语形态学

获取原文

摘要

One Hot Encoding (OHE) is currently the norm in text encoding for deep learning neural models. The main problem with OHE is that the size of the input vector, and hence the number of neurons in the input layer, depends on the size of the vocabulary. Experience has shown that the training time for text classification neural models grows exponentially with the size of the vocabulary when OHE is used. For example, if the size of the vocabulary is 10,000, then the size of the input vector will be model 10,000 implying 10,000 neurons in the input layer. This paper proposes and illustrates the use of an alternative Reversible Integer Transformation (RIT) whereby each word in the training/testing set is transformed into base-64 integer format. The transformation is reversible, and the output of the network can easily be converted back to string format (without the need for an index). Another important feature is that each character in the word is represented using only six bits at the appropriate position in the resulting base-64 integer. The maximum number of neurons needed in the input layer is 64, but the actual number of neurons depends on the maximum word length in the vocabulary, and is usually below 64.
机译:目前,一种深度编码(OHE)是深度学习神经模型的文本编码规范。 OHE的主要问题是输入向量的大小以及输入层中神经元的数量取决于词汇量。经验表明,使用OHE时,文本分类神经模型的训练时间会随着词汇量的增加而呈指数增长。例如,如果词汇表的大小为10,000,则输入向量的大小将为10,000的模型,表示输入层中有10,000个神经元。本文提出并说明了替代可逆整数转换(RIT)的用法,该方法将训练/测试集中的每个单词都转换为以64为基数的整数格式。这种转换是可逆的,并且网络的输出可以轻松地转换回字符串格式(无需索引)。另一个重要特征是单词中的每个字符仅在生成的base-64整数的适当位置使用六位表示。输入层所需的最大神经元数为64,但实际神经元数取决于词汇表中的最大字长,通常小于64。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号