首页> 外文会议>Annual conference of the International Speech Communication Association >Rethinking The Corpus: Moving towards Dynamic Linguistic Resources
【24h】

Rethinking The Corpus: Moving towards Dynamic Linguistic Resources

机译:重新思考语料库:转向动态语言资源

获取原文

摘要

The corpus is an invaluable resource in Spoken and Natural Language Processing. Consistent data sets have allowed for empirical evaluation of competing algorithms. The sharing of high-quality annotated linguistic data has enabled participation and experimentation by a wide range of researchers. However, despite dubbing these annotations as "gold-standard", many corpora contain labeling errors and idiosyncrasies. The current view of the corpus as a static resource makes correction of errors and other modifications prohibitively difficult. In this paper, a perspective of the corpus as dynamically changing is advanced. We highlight the problems of the static view of the corpus through case studies of the Penn Treebank, Switchboard, Hub-4 and Boston University Radio News Corpus. We propose the use of version control software as a mechanism to facilitate this dynamic view.
机译:语料库是口语和自然语言处理中的宝贵资源。一致的数据集允许对竞争算法进行实证评估。高质量注释语言数据的共享使众多研究人员能够参与和试验。但是,尽管将这些注释称为“黄金标准”,但许多语料库仍包含标签错误和特质。语料库作为静态资源的当前观点使得错误纠正和其他修改变得异常困难。本文提出了语料库动态变化的观点。通过对Penn Treebank,Switchboard,Hub-4和Boston University Radio News语料库的案例研究,我们突出了语料库静态视图的问题。我们建议使用版本控制软件作为促进此动态视图的机制。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号