【24h】

Tibetan Word Segmentation as Sub-syllable Tagging with Syllable's Part-of-Speech Property

机译:具有音节词性的藏语分词作为子音节标记

获取原文

摘要

When Tibetan word segmentation task is taken as a sequence labelling problem, machine learning models such as ME and CRFs can be used to train the segmenter. The performance of the segmenter is related to many factors. In the paper, three factors, namely strategy on abbreviated syllables, tag set, and the syllable's Part-Of-Speech property, are compared. Experiment data show that: first, if each abbreviate syllable is separated into two units for labelling rather than one, the F-measure improves 0.06% and 0.10% on 4-tag set and 6-tag set respectively. Second, if 6-tag set is used rather than 4-tag set, the F-measure improves 0.10 % and 0.14 % on the two strategies on abbreviated syllables respectively. Third, when the syllable's Part-Of-Speech property is take into account, F-measure improves 0.47% and 0.41% respectively than the other two methods without using it on 4-tag set, while it improves 0.45 % and 0.35 % on 6-tag set, which is much more higher than the former improvements. So it's a better choice to take advantage of the syllable's Part-Of-Speech property information while using the sub-syllable as the tag unit.
机译:当将藏语切词任务作为序列标签问题时,可以使用机器学习模型(例如ME和CRF)来训练切词器。分段器的性能与许多因素有关。在本文中,比较了三个因素,即缩写音节的策略,标签集和音节的词性特性。实验数据表明:首先,如果将每个缩写音节分成两个单元进行标记而不是一个,则F量度分别对4标记集和6标记集提高了0.06%和0.10%。其次,如果使用6标记集而不是4标记集,则在缩写音节的两种策略上,F量度分别提高了0.10%和0.14%。第三,考虑到音节的词性特性,F测度比其他两种方法在不使用4标记集的情况下分别提高了0.47%和0.41%,而在6标记集上则提高了0.45%和0.35%。 -tag set,比以前的改进要高得多。因此,在将子音节用作标签单元的同时,利用音节的词性属性信息是一个更好的选择。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号