首页> 外文会议>International Conference on Digital Information, Networking, and Wireless Communications >Text encoding for deep learning neural networks: A reversible base 64 (Tetrasexagesimal) Integer Transformation (RIT64) alternative to one hot encoding with applications to Arabic morphology
【24h】

Text encoding for deep learning neural networks: A reversible base 64 (Tetrasexagesimal) Integer Transformation (RIT64) alternative to one hot encoding with applications to Arabic morphology

机译:深度学习神经网络的文本编码:可逆底座64(四脑Ximal)整数变换(RIT64)替代与Arabic形态的应用程序的一个热编码

获取原文

摘要

One Hot Encoding (OHE) is currently the norm in text encoding for deep learning neural models. The main problem with OHE is that the size of the input vector, and hence the number of neurons in the input layer, depends on the size of the vocabulary. Experience has shown that the training time for text classification neural models grows exponentially with the size of the vocabulary when OHE is used. For example, if the size of the vocabulary is 10,000, then the size of the input vector will be model 10,000 implying 10,000 neurons in the input layer. This paper proposes and illustrates the use of an alternative Reversible Integer Transformation (RIT) whereby each word in the training/testing set is transformed into base-64 integer format. The transformation is reversible, and the output of the network can easily be converted back to string format (without the need for an index). Another important feature is that each character in the word is represented using only six bits at the appropriate position in the resulting base-64 integer. The maximum number of neurons needed in the input layer is 64, but the actual number of neurons depends on the maximum word length in the vocabulary, and is usually below 64.
机译:一个热编码(OHE)目前是深度学习神经模型的文本编码中的标准。 OHE的主要问题是输入载体的大小,并且因此输入层中的神经元数取决于词汇表的大小。经验表明,文本分类神经模型的培训时间呈指数呈指数级,随着使用OHE的尺寸。例如,如果词汇量的尺寸为10,000,则输入向量的大小将是型号10,000暗示输入层中的10,000神经元。本文提出并说明了使用替代可逆整数变换(RIT),其中训练/测试集中的每个单词被转换为基本-64整数格式。转换是可逆的,网络的输出可以很容易地转换回字符串格式(无需索引)。另一个重要特征是,单词中的每个字符仅在得到的Base-64整数中仅在适当位置处使用六位。输入层所需的最大神经元数为64,但是神经元的实际数量取决于词汇表中的最大词长度,通常低于64。

著录项

相似文献

  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号