首页> 外文期刊>International journal on digital libraries >Using wavelet analysis for text categorization in digital libraries: a first experiment with Strathprints
【24h】

Using wavelet analysis for text categorization in digital libraries: a first experiment with Strathprints

机译:使用小波分析对数字图书馆中的文本进行分类:Strathprints的首次实验

获取原文
获取原文并翻译 | 示例
       

摘要

Digital libraries increasingly benefit from research on automated text categorization for improved access. Such research is typically carried out by means of standard test collections. In this article, we present a pilot experiment of replacing such test collections by a set of 6,000 objects from a real-world digital repository, indexed by Library of Congress Subject Headings, and test support vector machines in a supervised learning setting for their ability to reproduce the existing classification. To augment the standard approach, we introduce a combination of two novel elements: using functions for document content representation in Hilbert space, and adding extra semantics from lexical resources to the representation. Results suggest that wavelet-based kernels slightly outperformed traditional kernels on classification reconstruction from abstracts and vice versa from full-text documents, the latter outcome being due to word sense ambiguity. The practical implementation of our methodological framework enhances the analysis and representation of specific knowledge relevant to large-scale digital collections, in this case the thematic coverage of the collections. Representation of specific knowledge about digital collections is one of the basic elements of the persistent archives and the less studied one (compared to representations of digital objects and collections). Our research is an initial step in this direction developing further the methodological approach and demonstrating that text categorization can be applied to analyse the thematic coverage in digital repositories.
机译:数字图书馆越来越多地受益于自动文本分类研究,以改善访问权限。此类研究通常通过标准测试集来进行。在本文中,我们提供了一个试验性实验,该实验用来自现实世界数字存储库的6,000个对象集替换了这些测试集,并由国会图书馆主题词索引,并在有监督的学习环境中测试支持向量机的能力,复制现有分类。为了增强标准方法,我们引入了两个新颖元素的组合:使用函数在希尔伯特空间中进行文档内容表示,并将词法资源中的额外语义添加到表示中。结果表明,基于小波的内核在摘要的分类重构方面稍胜于传统的内核,反之亦然,在全文文档中反之亦然,后者的结果归因于词义的歧义。我们方法框架的实际实施可以增强对与大规模数字馆藏有关的特定知识的分析和表示能力,在这种情况下,可以对馆藏进行专题报道。关于数字收藏的特定知识的表示是永久档案的基本要素之一,而研究较少的要素(与数字对象和收藏的表示相比)。我们的研究是朝这个方向迈出的第一步,进一步发展了方法论方法,并证明了文本分类可用于分析数字存储库中的主题范围。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号