首页> 外文会议>Pacific-Asia conference on knowledge discovery and data mining >The Role of Hubs in Cross-Lingual Supervised Document Retrieval
【24h】

The Role of Hubs in Cross-Lingual Supervised Document Retrieval

机译:集线器在跨语言监督文件检索中的作用

获取原文

摘要

Information retrieval in multi-lingual document repositories is of high importance in modern text mining applications. Analyzing textual data is, however, not without associated difficulties. Regardless of the particular choice of feature representation, textual data is high-dimensional in its nature and all inference is bound to be somewhat affected by the well known curse of dimensionality. In this paper, we have focused on one particular aspect of the dimensionality curse, known as hubness. Hubs emerge as influential points in the k-nearest neighbor (kNN) topology of the data. They have been shown to affect the similarity based methods in severely negative ways in high-dimensional data, interfering with both retrieval and classification. The issue of hubness in textual data has already been briefly addressed, but not in the context that we are presenting here, namely the multi-lingual retrieval setting. Our goal was to gain some insights into the cross-lingual hub structure and exploit it for improving the retrieval and classification performance. Our initial analysis has allowed us to devise a hubness-aware instance weighting scheme for canonical correlation analysis procedure which is used to construct the common semantic space that allows the cross-lingual document retrieval and classification. The experimental evaluation indicates that the proposed approach outperforms the baseline. This shows that the hubs can indeed be exploited for improving the robustness of textual feature representations.
机译:在多语言文档存储库中的信息检索在现代文本挖掘应用程序中非常重要。但是,分析文本数据并非没有困难。不管特征表示的特定选择如何,文本数据本质上都是高维的,并且所有推论都一定会受到众所周知的维数诅咒的影响。在本文中,我们集中于维数诅咒的一个特定方面,即“中心性”。集线器作为数据的k最近邻(kNN)拓扑中的影响点出现。已显示它们在高维数据中以严重负面的方式影响基于相似性的方法,从而干扰了检索和分类。文本数据中的中心性问题已经得到了简要解决,但是在我们这里介绍的上下文中,即多语言检索设置中,还没有解决。我们的目标是获得对跨语言中心结构的一些见解,并利用它来改善检索和分类性能。我们的初步分析使我们能够为规范的相关性分析过程设计一个具有中心度的实例加权方案,该方案用于构建允许跨语言文档检索和分类的公共语义空间。实验评估表明,所提出的方法优于基线。这表明确实可以利用集线器来改善文本特征表示的鲁棒性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号