【24h】

Syntactic Chunking Across Different Corpora

机译:横跨不同的Corpora的句法分数

获取原文

摘要

Syntactic chunking has been a well-defined and well-studied task since its introduction in 2000 as the conll shared task. Though some efforts have been further spent on chunking performance improvement, the experimental data has been restricted, with few exceptions, to (part of) the Wall Street Journal data, as adopted in the shared task. It remains open how those successful chunking technologies could be extended to other data, which may differ in genre/domain and/or amount of annotation. In this paper we first train chunkers with three classifiers on three different data sets and test on four data sets. We also vary the size of training data systematically to show data requirements for chunkers. It turns out that there is no significant difference between those state-of-the-art classifiers; training on plentiful data from the same corpus (switchboard) yields comparable results to Wall Street Journal chunkers even when the underlying material is spoken; the results from a large amount of unmatched training data can be obtained by using a very modest amount of matched training data.
机译:自2000年作为Conll共享任务的介绍,句法分量是一项定义明确的任务。虽然已经在分支性能改进方面进一步努力,但实验数据已受到限制,少数例外,遍布共享任务所采用的华尔街日报数据(一部分)。它仍然打开这些成功的块技术如何扩展到其他数据,这些数据可能因类型/域和/或注释量而异。在本文中,我们首先在三个不同的数据集中使用三个分类器列车,并在四个数据集上进行测试。我们还系统地改变了培训数据的大小,以显示块的数据要求。事实证明,这些最先进的分类器之间没有显着差异;来自同一语料库(交换机)的丰富数据训练,即使在潜在的材料被说出来,也会对华尔街日记块的相当的结果产生了可比的结果;通过使用非常适度的匹配训练数据可以获得大量无与伦比的训练数据的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号