首页> 外文会议>9th International conference on language resources and evaluation >GenitivDB - a Corpus-Generated Database for German Genitive Classification
【24h】

GenitivDB - a Corpus-Generated Database for German Genitive Classification

机译:GenitivDB-德国语元分类的语料库生成数据库

获取原文

摘要

We present a novel NLP resource for the explanation of linguistic phenomena, built and evaluated exploring very large annotated language corpora. For the compilation, we use the German Reference Corpus (DeReKo) with more than 5 billion word forms, which is the largest linguistic resource worldwide for the study of contemporary written German. The result is a comprehensive database of German genitive formations, enriched with a broad range of intra- und extralinguistic metadata. It can be used for the notoriously controversial classification and prediction of genitive endings (short endings, long endings, zero-marker). We also evaluate the main factors influencing the use of specific endings. To get a general idea about a factor's influences and its side effects, we calculate chi-square-tests and visualize the residuals with an association plot. The results are evaluated against a gold standard by implementing tree-based machine learning algorithms. For the statistical analysis, we applied the supervised LMT Logistic Model Trees algorithm, using the WEKA software. We intend to use this gold standard to evaluate GenitivDB, as well as to explore methodologies for a predictive genitive model.
机译:我们提出了一种新颖的NLP资源来解释语言现象,并通过探索非常大型的带注释语言语料库来进行构建和评估。对于汇编,我们使用具有超过50亿个单词形式的德语参考语料库(DeReKo),这是全世界用于研究当代书面德语的最大语言资源。其结果是建立了一个完整的德国同系语数据库,并丰富了多种语言内和语外元数据。它可用于广受争议的分类和预测成语结尾(短结尾,长结尾,零标记)。我们还评估了影响使用特定结尾的主要因素。为了对一个因素的影响及其副作用有一个大致的了解,我们计算卡方检验,并通过关联图可视化残差。通过实施基于树的机器学习算法,根据黄金标准对结果进行了评估。对于统计分析,我们使用WEKA软件应用了监督的LMT Logistic模型树算法。我们打算使用这一黄金标准来评估GenitivDB,并探索预测性遗传模型的方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号