首页> 外文会议>Conference on empirical methods in natural language processing;International workshop on health text mining and information analysis >Identification of Parallel Sentences in Comparable Monolingual Corpora from Different Registers
【24h】

Identification of Parallel Sentences in Comparable Monolingual Corpora from Different Registers

机译:不同语种中可比单语语料库中平行句的识别

获取原文

摘要

Parallel aligned sentences provide useful information for different NLP applications. Yet, this kind of data is seldom available, especially for languages other than English. We propose to exploit comparable corpora in French which are distinguished by their registers (specialized and simplified versions) to detect and align parallel sentences. These corpora are related to the biomedical area. Our purpose is to state whether a given pair of specialized and simplified sentences is to be aligned or not. Manually created reference data show 0.76 inter-annotator agreement. We exploit a set of features and several automatic classifiers. The automatic alignment reaches up to 0.93 Precision, Recall and F-measure. In order to better evaluate the method, it is applied to data in English from the SemEval STS competitions. The same features and models are applied in monolingual and cross-lingual contexts, in which they show up to 0.90 and 0.73 F-measure, respectively.
机译:平行对齐的句子为不同的NLP应用程序提供了有用的信息。但是,此类数据很少可用,尤其是对于英语以外的语言。我们建议利用可比较的法文语料库,这些语料库以其寄存器(专用和简化版本)为特征,以检测并对齐平行句子。这些语料库与生物医学领域有关。我们的目的是说明给定的一对专业句子和简化句子是否要对齐。手动创建的参考数据显示注释者之间的一致性为0.76。我们利用了一组功能和几个自动分类器。自动对齐功能可达到0.93的精度,查全率和F量度。为了更好地评估该方法,将其应用于SemEval STS比赛的英语数据。相同的功能和模型适用于单语言和跨语言环境,它们分别显示高达0.90和0.73的F量度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号