...
首页> 外文期刊>Language Resources and Evaluation >Coupling an annotated corpus and a lexicon for state-of-the-art POS tagging
【24h】

Coupling an annotated corpus and a lexicon for state-of-the-art POS tagging

机译:耦合带注释的语料库和词典,以实现最新的POS标记

获取原文
获取原文并翻译 | 示例

摘要

This paper investigates how to best couple hand-annotated data with information extracted from an external lexical resource to improve part-of-speech tagging performance. Focusing mostly on French tagging, we introduce a maximum entropy Markov model-based tagging system that is enriched with information extracted from a morphological resource. This system gives a 97.75 % accuracy on the French Treebank, an error reduction of 25 % (38 % on unknown words) over the same tagger without lexical information. We perform a series of experiments that help understanding how this lexical information helps improving tagging accuracy. We also conduct experiments on datasets and lexicons of varying sizes in order to assess the best trade-off between annotating data versus developing a lexicon. We find that the use of a lexicon improves the quality of the tagger at any stage of development of either resource, and that for fixed performance levels the availability of the full lexicon consistently reduces the need for supervised data by at least one half.
机译:本文研究了如何将手工注释的数据与从外部词汇资源中提取的信息进行最佳组合,以提高词性标记的性能。我们主要关注法国标记,我们引入了基于最大熵马尔可夫模型的标记系统,该系统丰富了从形态资源中提取的信息。该系统在French Treebank上的准确度为97.75%,与没有词法信息的相同标记器相比,错误减少了25%(未知单词为38%)。我们进行了一系列实验,以帮助了解这些词法信息如何帮助提高标记的准确性。我们还对不同大小的数据集和词典进行了实验,以评估注释数据与开发词典之间的最佳权衡。我们发现,在任何一种资源开发的任何阶段,使用词典都可以提高标记器的质量,并且对于固定性能级别,完整词典的可用性始终将对监督数据的需求至少减少了一半。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号