首页> 外文期刊>Journal of the American Society for Information Science and Technology >Automatic Construction of English/Chinese Parallel Corpora
【24h】

Automatic Construction of English/Chinese Parallel Corpora

机译:自动构建英汉平行语料库

获取原文
获取原文并翻译 | 示例
       

摘要

As the demand for global information increases significantly, multilingual corpora has become a valuable linguistic resource for applications to cross-lingual information retrieval and natural language processing. In order to cross the boundaries that exist between different languages, dictionaries are the most typical tools. However, the general-purpose dictionary is less sensitive in both genre and domain. It is also impractical to manually construct tailored bilingual dictionaries or sophisticated multilingual thesauri for large applications. Corpus-based approaches, which do not have the limitation of dictionaries, provide a statistical translation model with which to cross the language boundary. There are many domain-specific parallel or comparable corpora that are employed in machine translation and cross-lingual information retrieval. Most of these are corpora between Indo-European languages, such as English/French and English/Spanish. The Asian/Indo-European corpus, especially English/Chinese corpus, is relatively sparse. The objective of the present research is to construct English/ Chinese parallel corpus automatically from the World Wide Web. In this paper, an alignment method is presented which is based on dynamic programming to identify the one-to-one Chinese and English title pairs. The method includes alignment at title level, word level and character level. The longest common subsequence (LCS) is applied to find the most reliable Chinese translation of an English word. As one word for a language may translate into two or more words repetitively in another language, the edit operation, deletion, is used to resolve redundancy. A score function is then proposed to determine the optimal title pairs. Experiments have been conducted to investigate the performance of the proposed method using the daily press release articles by the Hong Kong SAR government as the test bed. The precision of the result is 0.998 while the recall is 0.806. The release articles and speech articles, published by Hongkong & Shanghai Banking Corporation Limited, are also used to test our method, the precision is 1.00, and the recall is 0.948.
机译:随着对全球信息需求的显着增长,多语言语料库已成为用于跨语言信息检索和自然语言处理的宝贵语言资源。为了跨越不同语言之间存在的界限,词典是最典型的工具。但是,通用字典在体裁和领域上都不那么敏感。为大型应用程序手动构建量身定制的双语词典或复杂的多语言叙词表也是不切实际的。基于语料库的方法不受字典的限制,提供了一种统计翻译模型,可以跨越语言边界。在机器翻译和跨语言信息检索中使用了许多领域特定的并行或类似语料库。其中大多数是印度-欧洲语言之间的语料库,例如英语/法语和英语/西班牙语。亚洲/印度-欧洲语料库,尤其是英语/中文语料库,相对稀疏。本研究的目的是从万维网自动构建英语/汉语并行语料库。本文提出了一种基于动态规划的对齐方法,用于识别一对一的中文和英文标题对。该方法包括在标题级别,单词级别和字符级别的对齐。最长的公共子序列(LCS)用于查找英文单词的最可靠的中文翻译。由于一种语言的一个单词可能会重复以另一种语言翻译成两个或多个单词,因此使用编辑操作(删除)来解决冗余问题。然后提出分数函数,以确定最佳标题对。以香港特别行政区政府的每日新闻稿为实验平台,进行了实验以研究所提出方法的性能。结果的精度为0.998,而召回率为0.806。香港上海汇丰银行有限公司发布的发行文章和演讲文章也用于测试我们的方法,精度为1.00,召回率为0.948。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号