首页> 外文会议>ICONIP 2008;International conference on advances in neuro-information processing >On Efficient Content Based Information Retrieval Using SVM and Higher Order Correlation Analysis
【24h】

On Efficient Content Based Information Retrieval Using SVM and Higher Order Correlation Analysis

机译:基于支持向量机和高阶相关分析的基于内容的高效信息检索

获取原文
获取外文期刊封面目录资料

摘要

Efficient retrieval of information with regards to its meaning and content is an important problem in data mining systems for the creation, management and querying of very large information databases existing in the World Wide Web. In this paper we deal with the main aspect of the problem of content based retrieval, namely, with the problem of document classification, outlining a novel improved and systematic approach to it's solution. We present a document classification system for non-domain specific content based on the learning and generalization capabilities mainly of SVM neural networks. The main contribution of this paper lies on the feature extraction methodology which, first, involves word semantic categories and not raw words as other rival approaches. As a consequence of coping with the problem of dimensionality reduction, the proposed approach introduces a novel higher order approach for document categorization feature extraction by considering word semantic categories higher order correlation analysis, both two and three dimensional, based on cooccurrence analysis. The suggested methodology compares favourably to widely accepted, raw word frequency based techniques in a collection of documents concerning the Dewey Decimal Classification (DDC) system. In these comparisons different Multilayer Perceptrons (MLP) algorithms as well as the Support Vector Machine (SVM), the LVQ and the conventional k-NN technique are involved. SVM models seem to outperform all other rival methods in this study.
机译:就其含义和内容而言,有效地检索信息是数据挖掘系统中创建,管理和查询存在于万维网中的超大型信息数据库的重要问题。在本文中,我们处理基于内容的检索问题的主要方面,即文档分类问题,概述了一种新颖的,系统的解决方案。我们主要基于SVM神经网络的学习和归纳能力,提出了针对非特定领域内容的文档分类系统。本文的主要贡献在于特征提取方法,该方法首先涉及单词语义类别,而不是其他竞争方法所涉及的原始单词。由于解决了降维问题,该方法引入了一种新的高阶方法,该方法通过考虑基于共现分析的词语义类别二维和三维高阶相关性分析,来进行文档分类特征提取。在涉及杜威十进制分类(DDC)系统的文档集中,建议的方法与广泛接受的基于原始单词频率的技术相比具有优势。在这些比较中,涉及了不同的多层感知器(MLP)算法以及支持向量机(SVM),LVQ和常规的k-NN技术。在这项研究中,SVM模型似乎胜过所有其他竞争方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号