首页> 外文期刊>ACM transactions on Asian language information processing >A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families
【24h】

A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families

机译:低资源语言家庭双语词典归纳的广义约束方法

获取原文
获取原文并翻译 | 示例

摘要

The lack or absence of parallel and comparable corpora makes bilingual lexicon extraction a difficult task for low-resource languages. The pivot language and cognate recognition approaches have been proven useful for inducing bilingual lexicons for such languages. We propose constraint-based bilingual lexicon induction for closely related languages by extending constraints from the recent pivot-based induction technique and further enabling multiple symmetry assumption cycle to reach many more cognates in the transgraph. We further identify cognate synonyms to obtain many-to-many translation pairs. This article utilizes four datasets: one Austronesian low-resource language and three Indo-European high-resource languages. We use three constraint-based methods from our previous work, the Inverse Consultation method and translation pairs generated from Cartesian product of input dictionaries as baselines. We evaluate our result using the metrics of precision, recall, and F-score. Our customizable approach allows the user to conduct cross validation to predict the optimal hyperparameters (cognate threshold and cognate synonym threshold) with various combination of heuristics and number of symmetry assumption cycles to gain the highest F-score. Our proposed methods have statistically significant improvement of precision and F-score compared to our previous constraint-based methods. The results show that our method demonstrates the potential to complement other bilingual dictionary creation methods like word alignment models using parallel corpora for high-resource languages while well handling low-resource languages.
机译:对于缺乏资源的语言,缺少或缺乏平行和可比的语料库使得双语词典提取变得困难。事实证明,枢轴语言和同源识别方法可用于为此类语言引入双语词典。我们通过扩展最近基于枢轴的归纳技术的约束条件,并进一步使多重对称性假设循环能够在跨谱图中获得更多认知,为紧密相关的语言提出基于约束的双语词典归纳。我们进一步确定同源同义词以获得多对多翻译对。本文利用了四个数据集:一种南极低资源语言和三种印度-欧洲高资源语言。我们使用以前工作中的三种基于约束的方法,即反向咨询方法和从输入字典的笛卡尔积生成的翻译对作为基线。我们使用精度,召回率和F分数来评估我们的结果。我们的可定制方法允许用户使用启发式方法和对称假设周期数的各种组合进行交叉验证,以预测最佳超参数(同源阈值和同源同义词阈值),以获得最高的F分数。与我们以前的基于约束的方法相比,我们提出的方法在统计和精度上都有明显的提高。结果表明,我们的方法展示了对其他双语词典创建方法(例如针对高资源语言使用并行语料库的单词对齐模型)进行补充的潜力,同时可以很好地处理低资源语言。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号