首页> 外文会议>International Symposium on Methodologies for Intelligent Systems >Harvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics
【24h】

Harvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics

机译:使用统计分类和基于类比的启发式来收集可比较的Corpora并为同等的双语句子挖掘它们

获取原文

摘要

Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our new methodologies for mining such data from previously obtained comparable corpora. The task is highly practical since non-parallel multilingual data exist in far greater quantities than parallel corpora, but parallel sentences are a much more useful resource. Here we propose a web crawling method for building subject-aligned comparable corpora from e.g. Wikipedia dumps and Euronews web page. The improvements in machine translation are shown on Polish-English language pair for various text domains. We also tested another method of building parallel corpora based on comparable corpora data. It lets automatically broad existing corpus of sentences from subject of corpora based on analogies between them.
机译:并行句子是一种相对稀缺,但很有用的资源对于许多应用,包括交叉定向检索和统计机器翻译。本研究探讨了我们从先前获得的可比较小型挖掘此类数据的新方法。该任务非常实用,因为非平行的多语言数据存在于平行语料库的数量较大,但并行句子是更有用的资源。在这里,我们提出了一种用于从例如,从例如建立主题对齐的比较的基层网络爬网方法。维基百科转储和欧洲摆写网页。对于各种文本域的波兰语语言对,显示了机器翻译的改进。我们还基于可比的Corpora数据测试了另一种构建平行语料的方法。它可以自动根据它们之间的类比自动广泛的基础上的句子。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号