首页> 外文会议>International conference on Asian language processing >Tibetan Word Segmentation Method Based on BiLSTM_ CRF Model
【24h】

Tibetan Word Segmentation Method Based on BiLSTM_ CRF Model

机译:基于BiLSTM_ CRF模型的藏文分词方法

获取原文

摘要

Tibetan word segmentation is one of the key technologies to realize Tibetan speech synthesis and Tibetan speech recognition. Traditional Tibetan word segmentations mainly relied on the combination of rules and statistics. The model automatic learning features become possible in the era of deep learning. This paper proposes a Tibetan word segmentation method based on bidirectional long-term memory neural network with conditional random field model (BiLSTM_ CRF). The Tibetan sentence is firstly divided into clauses, words and abbreviated words manually. Low-frequency words are removed to form a Tibetan dictionary. The text features are then extracted with the dictionary by embedding words into the corpus using Word2vec to get word vectors. The word vector features are transmited to the BiLSTM model. The learned result from BiLSTM model is finally transmitted as features to the CRF model for four-word labeling to obtain the Tibetan word segmentation results. The experimental results show that the proposed Tibetan word segmentation method can achieve better word segmentation effect. The accuracy of word segmentation can reach 94.33%, the recall rate is 93.89% and the F value is 94.11%.
机译:藏语分词是实现藏语语音合成和藏语语音识别的关键技术之一。传统的藏语分词主要依靠规则和统计的结合。在深度学习时代,模型自动学习功能成为可能。提出了一种基于双向长期记忆神经网络和条件随机场模型(BiLSTM_CRF)的藏语分词方法。首先将藏语句子手动分为从句,单词和缩写单词。去除低频词,形成藏文字典。然后通过使用Word2vec将单词嵌入到语料库中以获取单词向量,从而通过字典提取文本特征。词向量特征被传输到BiLSTM模型。来自BiLSTM模型的学习结果最终作为特征传输到CRF模型以进行四词标记,以获得藏文词分割结果。实验结果表明,提出的藏文分词方法可以达到较好的分词效果。分词的准确率可以达到94.33%,召回率为93.89%,F值为94.11%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号