首页> 外文学位 >Crosslingual implementation of linguistic taggers using parallel corpora.
【24h】

Crosslingual implementation of linguistic taggers using parallel corpora.

机译:使用并行语料库的语言标记器的跨语言实现。

获取原文
获取原文并翻译 | 示例

摘要

This thesis addresses the problem of creating linguistic taggers for resource-poor languages using existing taggers in resource rich languages. Linguistic taggers are classifiers that map individual words or phrases from a sentence to a set of tags. Part of speech tagging and named entity extraction are two examples of linguistic tagging. Linguistic taggers are usually trained using supervised learning algorithms. This requires the existence of labeled training data, which is not available for many languages.;A parallel corpus of the source and target languages might not be readily available, for many language pairs. To deal with this problem, we describe a system for automatic acquisition of aligned, bilingual corpora from pre-specified domains on the World Wide Web. The system involves automatic indexing of a given domain using a web crawler, identifying pairs of pages that are translations of one another, and aligning bilingual texts at the sentence level. Using this approach we create a 40,000,000 word English-French parallel corpus from the Government of Canada domain. The quality of this corpus is evaluated and compared to other parallel corpora.;We describe an approach for assigning linguistic tags to sentences in a target (resource-poor) language by exploiting a linguistic tagger that has been configured in a source (resource-rich) language. The approach does not require that the input sentence be translated into the source language. Instead, projection of linguistic tags is accomplished through the use of a parallel corpus, which is a collection of texts that are available in a source language and a target language. The correspondence between words of the source and target language allows us to project tags from source to target language words. The projected tags are further processed to compute the final tags of the target language words. A system for part of speech (POS) tagging of French language sentences using an English language POS tagger and an English/French parallel corpus has been implemented and evaluated using this approach.
机译:本文解决了使用资源丰富的语言中的现有标记器为资源匮乏的语言创建语言标记器的问题。语言标记器是将单个单词或短语从句子映射到一组标记的分类器。语音标记和命名实体提取是语言标记的两个示例。语言标记器通常使用监督学习算法进行训练。这要求存在标记的训练数据,这对于许多语言而言是不可用的。对于许多语言对,可能无法轻易获得源语言和目标语言的并行语料库。为了解决这个问题,我们描述了一种用于从万维网上预先指定的域中自动获取对齐的双语语料库的系统。该系统包括使用网络搜寻器自动索引给定的域,识别彼此翻译的成对页面以及在句子级别对齐双语文本。使用这种方法,我们从加拿大政府领域创建了一个40,000,000个单词的英语-法语平行语料库。评估该语料库的质量并将其与其他并行语料库进行比较。;我们描述了一种方法,该方法通过利用已在源(资源丰富)中配置的语言标记器为目标(资源贫乏)语言中的句子分配语言标记) 语言。该方法不需要将输入句子翻译成源语言。相反,语言标签的投影是通过使用并行语料库来完成的,该语料库是可在源语言和目标语言中获得的文本的集合。源语言单词和目标语言单词之间的对应关系使我们能够将标签从源语言单词投射到目标语言单词。投影的标签被进一步处理以计算目标语言单词的最终标签。使用此方法,已实现并评估了使用英语POS标记器和英语/法语并行语料库对法语句子进行词性(POS)标记的系统。

著录项

  • 作者

    Safadi, Hani.;

  • 作者单位

    McGill University (Canada).;

  • 授予单位 McGill University (Canada).;
  • 学科 Computer science.;Electrical engineering.;Artificial intelligence.
  • 学位 M.Sc.
  • 年度 2008
  • 页码 62 p.
  • 总页数 62
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号