首页> 外文会议>Conference on empirical methods in natural language processing >An Iterative Link-based Method for Parallel Web Page Mining
【24h】

An Iterative Link-based Method for Parallel Web Page Mining

机译:一种基于迭代链接的并行网页挖掘方法

获取原文

摘要

Identifying parallel web pages from bilingual web sites is a crucial step of bilingual resource construction for cross-lingual information processing. In this paper, we propose a link-based approach to distinguish parallel web pages from bilingual web sites. Compared with the existing methods, which only employ the internal translation similarity (such as content-based similarity and page structural similarity), we hypothesize that the external translation similarity is an effective feature to identify parallel web pages. Within a bilingual web site, web pages are interconnected by hyperlinks. The basic idea of our method is that the translation similarity of two pages can be inferred from their neighbor pages, which can be adopted as an important source of external similarity. Thus, the translation similarity of page pairs will influence each other. An iterative algorithm is developed to estimate the external translation similarity and the final translation similarity. Both internal and external similarity measures are combined in the iterative algorithm. Experiments on six bilingual websites demonstrate that our method is effective and obtains significant improvement (6.2% F-Score) over the baseline which only utilizes internal translation similarity.
机译:识别来自双语网站的并行网页是交叉语言信息处理的双语资源结构的关键步骤。在本文中,我们提出了一种基于链接的方法来区分与双语网站的并行网页。与现有方法相比,该方法仅采用内部翻译相似(如基于内容的相似性和页面结构相似性),我们假设外部翻译相似性是识别并行网页的有效特征。在双语网站中,网页通过超链接互连。我们方法的基本思想是,可以从邻居页面推断出两页的翻译相似性,这可以被采用作为外部相似性的重要来源。因此,页面对的翻译相似度将相互影响。开发了一种迭代算法来估计外部翻译相似性和最终的翻译相似性。内部和外部相似度测量都以迭代算法组合。六种双语网站的实验表明,我们的方法是有效的,并在基线上获得显着的改进(6.2%F分),其仅利用内部翻译相似性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号