A Generative Model for Extracting Parallel Fragments from Comparable Documents

机译：一种用于从可比文档中提取并行片段的生成模型

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Although parallel corpora are essential language resources for many NLP tasks, they are rare or even not available for many language pairs. Instead, comparable corpora are widely available and contain parallel fragments of information that can be used applications like statistical machine translations. In this research, we propose a generative LDA based model for extracting parallel fragments from comparable documents without using any initial parallel data or bilingual lexicon. The experimental results show significant improvement if the extracted sentence fragments generated by the proposed method are used in addition to an existing parallel corpus in an SMT task. According to human judgment, the accuracy of the proposed method for an English-Persian task is about 66%. Also, the OOV rate for the same task is reduced by 28%.

机译：虽然并行Corpora是许多NLP任务的重要语言资源，但它们很少见或甚至不适用于许多语言对。相反，可比较的Corpora广泛可用，并包含可以使用统计机器翻译等应用程序的并行片段。在本研究中，我们提出了一种基于生成的LDA模型，用于在不使用任何初始并行数据或双语词典的情况下从可比文档中提取并行片段。如果使用由SMT任务中的现有并行语料库，使用由所提出的方法生成的提取的句子片段，则实验结果显示出显着的改进。根据人为判断，英国波斯任务的提议方法的准确性约为66％。此外，同一任务的OOV率降低了28％。

著录项

来源
《Workshop on building and using comparable corpora》|2015年||共9页
会议地点
作者
Somayeh Bakhshaei; Shahram Khadivi; Reza Safabakhsh;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算机软件;
关键词

相似文献

外文文献
中文文献
专利

1. Extracting parallel fragments from comparable documents using a generative model [J] . Bakhshaei Somayeh, Safabakhsh Reza, Khadivi Shahram Computer speech and language . 2019,第JANa期

机译：使用生成模型从可比较文档中提取平行片段
2. Discriminative learning of generative models: large margin multinomial mixture models for document classification [J] . Jiang Hui, Pan Zhenyu, Hu Pingzhao Pattern Analysis and Applications . 2015,第3期

机译：生成模型的判别学习：用于文档分类的大幅度多项式混合模型
3. Chinese-Khmer Parallel fragments Extraction from Comparable Corpus Based on Dirichlet Process [J] . Shan Ning, Xin Yan, Yu Nuo, Procedia Computer Science . 2020,第5期

机译：基于Dirichlet工艺的中华高棉平行片段从可比语料中提取
4. A Generative Model for Extracting Parallel Fragments from Comparable Documents [C] . Somayeh Bakhshaei, Shahram Khadivi, Reza Safabakhsh Workshop on building and using comparable corpora . 2015

机译：从可比较文档中提取并行片段的生成模型
5. Parallel Sentence Detection in Comparable Corpora with Bilingual Word Embeddings for Low-Resource Languages [D] . Cadigan, John. 2018

机译：与低资源语言的双语单词嵌入式的同类语料中的并行句子检测
6. A toxicology suite adapted for comparing parallel toxicity responses of model human lung cells to diesel exhaust particles and their extracts [O] . Jane Turner, Mark Hernandez, John E. Snawder, -1

机译：适用于比较模型人肺细胞对柴油机尾气颗粒及其提取物的平行毒性反应的毒理学套件
7. A Generative Model for Extracting Parallel Fragments from Comparable Documents [O] . Somayeh Bakhshaei, Shahram Khadivi, Reza Safabakhsh 2015

机译：一种用于从可比文档中提取并行片段的生成模型

A Generative Model for Extracting Parallel Fragments from Comparable Documents

摘要

著录项

相似文献

相关主题

期刊订阅