首页> 外文期刊>Journal of Information Science >Double-pass clustering technique for multilingual document collections
【24h】

Double-pass clustering technique for multilingual document collections

机译:多语言文档收集的双遍聚类技术

获取原文
获取原文并翻译 | 示例
           

摘要

It is often necessary to categorize automatically multilingual document sets, in which documents written in a variety of languages are included, into topically homogeneous subsets, such as when applying an automatic summarization system for multilingual news articles. However, there have been few studies on multilingual document clustering to date. In particular, it is not known whether clustering techniques are effective in medium- or large-scale multilingual document sets. For scalability, techniques should be based on dictionary-based translation and a single- or double-pass clustering algorithm. This article reports on experiments of applying multilingual document clustering to medium-scale sets of English, French, German and Italian documents (Reuters news articles). The results show that the double-pass algorithm has a positive effect in the case that each document is translated. On the other hand, the cluster translation strategy in which clusters obtained by applying a clustering algorithm to each language document set are translated has almost no effect. Also, translation disambiguation techniques can improve, but only slightly, the effectiveness of clustering.
机译:通常有必要将自动多语言文档集(其中包括以多种语言编写的文档)自动分类为同质子集,例如在对多语言新闻文章应用自动摘要系统时。但是,迄今为止,关于多语言文档聚类的研究很少。尤其是,尚不知道聚类技术在中型或大型多语言文档集中是否有效。为了实现可伸缩性,技术应基于基于字典的翻译和单遍或双遍聚类算法。本文报道了将多语言文档聚类应用于中等规模的英语,法语,德语和意大利语文档集的实验(路透社新闻文章)。结果表明,在翻译每个文档的情况下,双遍算法都具有积极作用。另一方面,将通过对每个语言文档集应用聚类算法而获得的聚类翻译成簇的聚类翻译策略几乎没有效果。同样,翻译歧义消除技术可以提高聚类的效率,但仅会略有提高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号