【24h】

Tagging with Small Training Corpora

机译:用小型培训语料库标记

获取原文
获取外文期刊封面目录资料

摘要

The analysis of textual data may start by classifying words using a predefined tag set. However, it is still a problem for natural language text understanding the assignment of part-of-speech tags to words in unrestricted text (called POS-tagging). Most part of current taggers require huge amounts of hand tagged text for training (in the order of 10{sup}5 pretagged words): it requires linguistically highly trained man power for a highly repetitive and boring job, and the results obtained have no optimal quality. Moreover, when one wants to change to another text genre the same kind of problem must be faced again. Our proposal goes in another direction. By carefully combining a large lexicon with an efficient neural network based generator of taggers we can generate POS-taggers using no more than 10{sup}4 hand corrected tagged words for training. This training tagged text size can be feasibly hand corrected. Experimental results are presented and discussed for the SUSANNE Corpus. Results in three additional different Portuguese corpora are also discussed. 96% precision rates are obtained when unknown words occur in the test set. 98% precision rates are obtained when every word in the test set is known.
机译:通过使用预定义的标记集来分类单词来开始对文本数据的分析。但是,自然语言文本的问题仍然是理解语音部分分配给不受限制的文本中的单词(称为POS标记)的单词。当前标签的大部分部分都需要大量的手工标记文本进行培训(按10 {sup} 5折磨单词的顺序):它需要针对高度重复和无聊的工作的语言训练有素的人力,并且获得的结果没有最佳的结果质量。此外,当人们想要改变到另一个文本类型时,必须再次面对相同类型的问题。我们的提案进入了另一个方向。通过小心地将大型Lexicon与Tabgers的高效神经网络的发电机组合,我们可以使用不超过10 {sup} 4手纠正标记单词来生成POS-Taggers进行培训。此培训标记的文本大小可以是可行的手动纠正。提出和讨论了苏珊群语料库的实验结果。结果还讨论了三种不同的葡萄牙语。当测试集中未知的单词发生时,获得了96%的精确率。当测试集中的每个单词是已知的,获得98%的精确率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号