首页> 外文学位 >Information retrieval: A framework for recommending text-based classification algorithms.
【24h】

Information retrieval: A framework for recommending text-based classification algorithms.

机译:信息检索:一种推荐基于文本的分类算法的框架。

获取原文
获取原文并翻译 | 示例

摘要

Classification is one of the central issues in information retrieval systems dealing with text data. The need for effective approaches has been dramatically increased due to the advent of the World Wide Web and massive digital libraries. Effective methods are invaluable for the exploration of information repositories with the aim to discover similarities between groups of text-based documents.; One goal of this thesis is the development of tools for supporting users of machine learning and data mining algorithms in the area of text classification. While the interest in such technology is growing rapidly, tools are still limited to end-users who are not experts. This is due to the fact that machine learning systems are difficult to design and their number keeps increasing. As a result, system designers are faced with two major research problems: algorithmic model selection and model combination, i.e., (a) selecting the most suitable model/algorithm to use on a given application, and (b) integrating this with useful and effective transformations of the data. Traditionally, these problems are resolved by trial-and-error or through consultation of experts. The first solution is time consuming and unreliable. The second solution is expensive and biased by the expert's own prejudices and preferences. This thesis develops a meta-model framework system called the Regression Model Framework (RMF) that supports system designers with model selection and method combination. RMF uses statistical regression analysis to combine prior meta-knowledge with meta-level learning.; The second major goal of this thesis is to investigate how text classification is performed on the Web. A great deal of text-based documents are available on the Internet and in corporate intranets, and categorizing them into useful semantic categories is a rewarding and challenging research problem. However, current approaches to text categorization on the Web mostly concentrate on simple representation schemes that are based on word occurrence and word frequency. The structural information that is inherent to documents on the Web is usually neglected. In analyzing Web documents, the relative importance of hypertext tags is investigated in order to ascertain their relative importance in predicting the relevance of unknown documents.
机译:分类是处理文本数据的信息检索系统的中心问题之一。由于万维网和海量数字图书馆的出现,对有效方法的需求已大大增加。有效的方法对于探索信息库,以发现基于文本的文档组之间的相似性是无价的。本文的目标之一是在文本分类领域开发支持机器学习和数据挖掘算法用户的工具。尽管人们对这种技术的兴趣迅速增长,但工具仍然仅限于非专家的最终用户。这是由于以下事实:机器学习系统难以设计,并且其数量还在不断增加。结果,系统设计人员面临两个主要的研究问题:算法模型选择和模型组合,即(a)选择最适合在给定应用程序上使用的模型/算法,以及(b)将其与有用和有效的集成在一起数据转换。传统上,这些问题是通过反复试验或通过专家咨询来解决的。第一种解决方案是耗时且不可靠的。第二种解决方案价格昂贵,并因专家自身的偏见和偏爱而有偏差。本文开发了一种称为回归模型框架(RMF)的元模型框架系统,该系统通过模型选择和方法组合为系统设计人员提供支持。 RMF使用统计回归分析将先前的元知识与元级学习相结合。本文的第二个主要目标是研究如何在Web上执行文本分类。 Internet和公司Intranet上都有大量基于文本的文档,将它们分类为有用的语义类别是一个有意义且具有挑战性的研究问题。但是,当前Web上的文本分类方法主要集中在基于单词出现和单词频率的简单表示方案上。 Web文档固有的结构信息通常被忽略。在分析Web文档时,研究了超文本标签的相对重要性,以便确定它们在预测未知文档的相关性方面的相对重要性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号