首页> 外文会议>International Conference on Applications of Natural Language to Informations Systems >Graph-Based Bilingual Sentence Alignment from Large Scale Web Pages
【24h】

Graph-Based Bilingual Sentence Alignment from Large Scale Web Pages

机译:基于图形的双语句子对齐来自大规模网页

获取原文

摘要

Sentence alignment is an enabling technology which extracts mass of bilingual corpora automatically from the vast and ever-growing Web pages. In this paper, we propose a novel graph-based sentence alignment approach. Compared with the existing approaches, ours is more resistant to the noise and structure diversity of Web pages by considering sentence structural features. We formulate sentence alignment to be a matching problem between nodes (bilingual sentences) of a bipartite graph. The maximum-weighted bipartite graph matching algorithm is first applied to sentence alignment for global optimal matching. Moreover, sentence merging and aligned sentence pattern detection are used to deal with the many-to-many matching issue and the low probability of aligned sentences with few mutual translated words issue respectively. We achieve good precision over 85% and recall over 80% on manually annotated data and 1 million aligned sentence pairs with over 82% accuracy are extracted from 0.8 million bilingual pages.
机译:句子对齐是一种启用技术,可自动从广阔而不断增长的网页中提取双语语料库。在本文中,我们提出了一种基于图形的句子对齐方法。与现有方法相比,我们通过考虑句子结构特征,我们对网页的噪声和结构多样性更具抵抗力。我们将句子对齐构成为二角形图的节点(双语句子)之间的匹配问题。最大加权二分图匹配算法首先应用于全局最优匹配的句子对齐。此外,句子合并和对齐的句子模式检测用于处理多对多的匹配问题和与少数相互翻译的单词问题的对齐句子的低概率。我们在手动注释的数据中获得了超过85%的良好精度,并召回超过80%,并从08万个双语页中提取超过82%的准确度的100万句话对。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号