首页> 外文会议>Linguistic annotation workshop >STTS 2.0? Improving the Tagset for the Part-of-Speech-Tagging of German Spoken Data
【24h】

STTS 2.0? Improving the Tagset for the Part-of-Speech-Tagging of German Spoken Data

机译:Stts 2.0?改进德语口语数据的词性标记的标记标记

获取原文

摘要

Part-of-speech tagging (POS-tagging) of spoken data requires different means of annotation than POS-tagging of written and edited texts. In order to capture the features of German spoken language, a distinct tagset is needed to respond to the kinds of elements which only occur in speech. In order to create such a coherent tagset the most prominent phenomena of spoken language need to be analyzed, especially with respect to how they differ from written language. First evaluations have shown that the most prominent cause (over 50%) of errors in the existing automatized POS-tagging of transcripts of spoken German with the Stuttgart Tuebingen Tagset (STTS) and the treetagger was the inaccurate interpretation of speech particles. One reason for this is that this class of words is virtually absent from the current STTS. This paper proposes a recategorization of the STTS in the field of speech particles based on distributional factors rather than semantics. The ultimate aim is to create a comprehensive reference corpus of spoken German data for the global research community. It is imperative that all phenomena are reliably recorded in future part-of-speech tag labels.
机译:语音标记(POS标记)的口语数据需要不同的注释方式,而不是写入和编辑文本的POS标记。为了捕获德语口语的特征,需要一个不同的标签,以响应仅在语音中出现的元素种类。为了创造这种连贯的标签,需要分析口语语言最突出的现象,特别是对于他们与书面语言的不同之处。第一个评估表明,使用斯图加特Tuebingen Tagset(Stts)和Tegragegage的现有自动化POS标记的最突出的原因(超过50%)的错误,以及TELEGRAGGER是语音粒子的解释不准确。这样做的一个原因是,这类词几乎没有来自当前的Stts。本文基于分布因子而不是语义,提出了语音粒子领域中的Stts的重复化。最终目标是为全球研究界创建德国德语数据的全面参考语料库。必须在未来的言论标签标签中可靠地记录所有现象。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号