首页> 外文会议>SIGMORPHON workshop on computational research in phonetics, phonology, and morphology >A Little Linguistics Goes a Long Way: Unsupervised Segmentation with Limited Language Specific Guidance
【24h】

A Little Linguistics Goes a Long Way: Unsupervised Segmentation with Limited Language Specific Guidance

机译:一点语言学有很长的路要科:具有有限的语言特定指导的无监督分割

获取原文

摘要

We present de-lexical segmentation, a linguistically motivated alternative to greedy or other unsupervised methods, requiring language specific knowledge, but no direct supervision. Our technique involves creating a small grammar of closed-class affixes which can be written in a few hours. The grammar over generates analyses for word forms attested in a raw corpus which are disambiguated based on features of the linguistic base proposed for each form. Extending the grammar to cover orthographic, morpho-syntactic or lexical variation is simple, making it an ideal solution for challenging corpora with noisy, dialect-inconsistent, or otherwise non-standard content. We demonstrate the utility of de-lexical segmentation on several dialects of Arabic. We consistently outperform competitive unsupervised baselines and approach the performance of state-of-the-art supervised models trained on large amounts of data, providing evidence for the value of linguistic input during preprocessing.
机译:我们提出了脱章细分,是一种语言上有动力的替代贪婪或其他无人监督的方法,需要语言特异性知识,但没有直接监督。我们的技术涉及创建一个小型的封闭式贴尾语法,可以在几个小时内写入。语法为基于为每个形式提出的语言基础的特征消除歧义,生成了在原始语料库中证明的单词形式的分析。扩展语法以覆盖正交,句法或词汇变化很简单,使其成为充满噪声,方言 - 不一致或其他非标准内容的挑战性能的理想解决方案。我们展示了在几种阿拉伯语方言上的解释细分的效用。我们始终如一地优于竞争性无监督的基线和方法,即在大量数据上培训的最先进的监督模型的表现,为预处理期间的语言输入值提供了证据。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号