【24h】

Optimization of Cross-Lingual LSI Training Data

机译:跨语明LSI培训数据的优化

获取原文

摘要

The technique of latent semantic indexing (LSI) is widely employed in applications to provide information retrieval, categorization, clustering, and discovery capabilities. In these applications, the key relevant feature of the technique is the ability to compare objects (such as documents and queries) based on the semantics of their constituents. These comparisons are carried out in a high-dimensional vector space. That space is generated based on an analysis of occurrences of features in items of a training set. In the LSI literature there are multiple references to the fact that training items should be selected that are similar in content to the items to be dealt with in the application. This paper presents a principled approach for making such selection. We present test results for the technique for cross-lingual document similarity comparison. The results demonstrate that, at least for this use case, employment of the technique can have a dramatic beneficial effect on LSI performance.
机译:潜在语义索引(LSI)的技术被广泛用于应用信息检索,分类,群集和发现能力。在这些应用中,该技术的关键相关特征是能够基于其成分的语义比较对象(例如文档和查询)。这些比较在高维向量空间中进行。基于对训练集的项目的出现的分析来生成该空间。在LSI文献中,有多次引用的事实是,应选择培训项目,其内容类似于在应用程序中处理的项目。本文提出了制作此类选择的原则方法。我们为交叉语言文档相似性比较提供了技术的测试结果。结果表明,至少对于这种用例,该技术的就业可能对LSI性能具有显着的有益影响。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号