首页> 外文会议>International Conference of the Cross-Language Evaluation Forum >A Language-Independent Approach to Identify the Named Entities in Under-Resourced Languages and Clustering Multilingual Documents
【24h】

A Language-Independent Approach to Identify the Named Entities in Under-Resourced Languages and Clustering Multilingual Documents

机译:一种独立于语言的方法,可以识别资源不足的语言和聚类多语言文档中的命名实体

获取原文

摘要

This paper presents a language-independent Multilingual Document Clustering (MDC) approach on comparable corpora. Named entites (NEs) such as persons, locations, organizations play a major role in measuring the document similarity. We propose a method to iden-tify these NEs present in under-resourced Indian languages (Hindi and Marathi) using the NEs present in English, which is a. high resourced lan-guage. The identified NEs are then utilized for the formation of multilin-gual document clusters using the Bisecting k-means clustering algorithm. We didn't make use of any non-English linguistic tools or resources such as WordNet, Part-Of-Speech tagger, bilingual dictionaries, etc., which makes the proposed approach completely language-independent. Exper-iments are conducted on a standard dataset provided by FIRE~1 for their 2010 Ad-hoc Cross-Lingual document retrieval task on Indian languages. We have considered English, Hindi and Marathi news datasets for our ex-periments. The system is evaluated using F-score, Purity and Normalized Mutual information measures and the results obtained are encouraging.
机译:本文介绍了可比的语料库上的语言无关的多语言文档聚类(MDC)方法。命名为entites(nes),如人员,位置,组织在测量文档相似度方面发挥着重要作用。我们提出了一种使用英文中存在的NES iden-tify identify indourced印度语言(Hindi和Marathi)中的这些网球的方法,这是一个。高资源的兰德。然后使用所识别的网元用于使用B分配K-Means聚类算法形成多引素Gual文献群集。我们没有利用任何非英语语言工具或资源,如Wordnet,术语标记,双语词典等,这使得提出的方法完全独立于语言。 exper-iment在印度语言上为2010年的Ad-hoc交叉语言文件检索任务提供的标准数据集进行。我们已经考虑了英语,印地语和马拉地赛新闻数据集,用于我们的前辈。使用F分数,纯度和标准化的互信息措施进行评估,并获得令人鼓舞的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号