首页> 外文期刊>Journal of Language Modelling >Design and analysis of a lean interface for Sanskrit corpus annotation
【24h】

Design and analysis of a lean interface for Sanskrit corpus annotation

机译:梵语语料标注精益界面的设计与分析

获取原文
       

摘要

We describe an innovative computer interface designed to assist annotators in the efficient selection of segmentation solutions for proper tagging of Sanskrit corpora. The proposed solution uses a compact representation of the shared forest of all segmentations. The main idea is to represent the union of all segmentations, abstracting from the sandhi rules used, and aligning with the input sentence. We show that this representation provides an exponential saving, in both space and time.The segmentation methodology is lexicon-directed. When the lexicon does not have full coverage of the corpus vocabulary, some chunks of the input may fail to be recognized. We designed a lexicon-acquisition facility, which remedies this incompleteness and makes the interface more robust.This interface has been implemented, and is currently being applied to the annotation of the Sanskrit Library corpus. Evaluation over 1,500 sentences from the Pa?catantra text shows the effectiveness of the proposed interface on real corpus data.
机译:我们描述了一种创新的计算机界面,该界面旨在帮助注释者有效地选择分割解决方案,以正确标记梵语集。提出的解决方案使用所有分段的共享林的紧凑表示。主要思想是代表所有细分的并集,从所使用的sandhi规则中抽象出来,并与输入句子保持一致。我们证明了这种表示方法在空间和时间上都提供了成倍的节省。分割方法是词典指导的。当词典没有完全覆盖语料库词汇时,输入的某些块可能无法识别。我们设计了一个词典获取工具,可以纠正这种不完整性并使界面更健壮。此接口已经实现,目前正在应用于梵文库语料的注释中。对Pa?catantra文本中超过1,500个句子的评估显示了所提出的界面对真实语料库数据的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号