Harvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics

机译：使用统计分类和基于类比的启发式来收集可比较的Corpora并为同等的双语句子挖掘它们

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our new methodologies for mining such data from previously obtained comparable corpora. The task is highly practical since non-parallel multilingual data exist in far greater quantities than parallel corpora, but parallel sentences are a much more useful resource. Here we propose a web crawling method for building subject-aligned comparable corpora from e.g. Wikipedia dumps and Euronews web page. The improvements in machine translation are shown on Polish-English language pair for various text domains. We also tested another method of building parallel corpora based on comparable corpora data. It lets automatically broad existing corpus of sentences from subject of corpora based on analogies between them.

机译：并行句子是一种相对稀缺，但很有用的资源对于许多应用，包括交叉定向检索和统计机器翻译。本研究探讨了我们从先前获得的可比较小型挖掘此类数据的新方法。该任务非常实用，因为非平行的多语言数据存在于平行语料库的数量较大，但并行句子是更有用的资源。在这里，我们提出了一种用于从例如，从例如建立主题对齐的比较的基层网络爬网方法。维基百科转储和欧洲摆写网页。对于各种文本域的波兰语语言对，显示了机器翻译的改进。我们还基于可比的Corpora数据测试了另一种构建平行语料的方法。它可以自动根据它们之间的类比自动广泛的基础上的句子。

著录项

来源
《International Symposium on Methodologies for Intelligent Systems》|2015年||共9页
会议地点
作者
Krzysztof Wolk; Emilia Rejmund; Krzysztof Marasek;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP18-53;
关键词

相似文献

外文文献
中文文献
专利

1. Extracting Translation Equivalents from Bilingual Comparable Corpora [J] . Hiroyuki KAJI IEICE Transactions on Information and Systems . 2005,第2期

机译：从双语可比语料库中提取翻译对等词
2. Extraction of Bilingual Dictionary from Comparable Corpora for Resource Scarce Languages [J] . Journal of computational and theoretical nanoscience . 2020,第1期

机译：从可比语料库中提取双语词典的资源稀缺语言
3. Exploiting unbalanced specialized comparable corpora for bilingual lexicon extraction [J] . EMMANUEL MORIN, AMIR HAZEM Natural language engineering . 2016,第pta4期

机译：利用不平衡的专业可比语料库提取双语词典
4. Harvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics [C] . Krzysztof Wolk, Emilia Rejmund, Krzysztof Marasek International symposium on methodologies for intelligent systems . 2015

机译：使用统计分类和基于类推的启发式方法收集等效语料库的可比语料库并对其进行挖掘
5. Parallel Sentence Detection in Comparable Corpora with Bilingual Word Embeddings for Low-Resource Languages [D] . Cadigan, John. 2018

机译：与低资源语言的双语单词嵌入式的同类语料中的并行句子检测
6. Bilingual term alignment from comparable corpora in English discharge summary and Chinese discharge summary [O] . Yan Xu, Luoxin Chen, Junsheng Wei, 2015

机译：可比语料库中英语出院摘要和中文出院摘要的双语术语对齐
7. Harvesting comparable corpora and mining them for equivalent bilingual sentences using statistical classification and analogy- based heuristics [O] . Wołk, Krzysztof, Rejmund, Emilia, Marasek, Krzysztof 2015

机译：收获可比较的语料库并挖掘它们以获得相同的双语使用统计分类和基于类比的启发式的句子

Harvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics

摘要

著录项

相似文献

相关主题

期刊订阅