首页> 外文会议>IEEE International Conference on Application of Information and Communication Technologies >Identification of the parallel documents from multilingual news websites
【24h】

Identification of the parallel documents from multilingual news websites

机译:识别来自多语言新闻网站的平行文件

获取原文

摘要

We present the initial results of our experiments on document alignment for the online news domain. Specifically, as apposed to cross-site comparable news alignment, we focus on the identification of parallel documents from within the same multilingual websites. In such a setting parallel news stories oftentimes turn out to be direct translations of each other with a tendency of sharing common media and displaying proximity in publication date. We leverage this domain-specific property of the data and propose a straightforward yet competitive heuristic that performs on par with a machine learning-based method in terms of precision, and outperforms a widely used bitext extraction system on a range of metrics. Moreover, this heuristic has allowed us to identify comparable documents overlooked by a human annotator. Although both rule- and learning-based methods that we present are language independent, we specifically focus on the Russian-Kazakh language pair as the present study is one of the initial steps towards a greater objective of building a corresponding parallel corpus and a machine translation system.
机译:我们介绍了在线新闻领域文档对齐方式实验的初步结果。具体而言,作为跨站点可比新闻对齐的基础,我们着重于从同一多语言网站中识别并行文档。在这种情况下,平行新闻故事经常被证明是彼此的直接翻译,具有共享公共媒体并在出版日期显示接近的趋势。我们利用数据的特定领域属性,提出了一种直接但具有竞争性的启发式方法,该方法在精度方面与基于机器学习的方法相当,并且在一系列指标上均优于广泛使用的bitext提取系统。此外,这种启发式方法使我们能够识别出人类注释者忽略的可比文档。尽管我们介绍的基于规则和基于学习的方法都是与语言无关的,但由于我们的研究是朝着建立相应的并行语料库和机器翻译这一更大目标的第一步,因此我们特别关注俄语-哈萨克语对。系统。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号