首页> 外文学位 >Application of genetic programming to text categorization.
【24h】

Application of genetic programming to text categorization.

机译:基因编程在文本分类中的应用。

获取原文
获取原文并翻译 | 示例

摘要

This dissertation uses genetic programming in text categorization problems. Genetic programming algorithms are applied to a set of news articles to evolve programs that determine whether the article belongs to a particular category. The programs are randomly generated from the set of initial functions and constants. Programs with the fewest amount of false assignments are favored in the selection for recombination in the subsequent iterations of the genetic programming algorithm.; The form of the solution is not determined a priori as in other text categorization methods. The basis set of functions and constants used by the genetic analysis program are specified in advance and may include the three basic logical functions and a set of vocabulary words. Other sets of basis functions can be supplied to the genetic algorithm to obtain different programs. The form in which these functions and constants are combined is determined randomly by the genetic algorithm.; The results indicate that genetic programming methods are in the cases examined as good and slightly better than other decision tree or rule induction methods described by Apté et. al. [Apté 1994]. The Genetic Programming methods used a simpler set of features and functions: no word stemming no explicit stop word removal, local dictionary, Boolean functions. The F 1-measure of categorization performance of 80.4% achieved by Genetic Programming compares favorably with 78.5% Breakeven performance of traditional Boolean rule induction methods. It is comparable with 80.5% Breakeven performance of the rule induction methods with a more complex feature set such as word frequency [Apté 1994].; Characteristics of Genetic Programming text categorization were studied to understand the sensitivity of Genetic Programming methods to vocabulary size, population size, training and testing set selection methods. Temporal characteristics of the Reuters Article Corpus [Lewis-21578) were studied. The results are of interest to both Genetic Programming as well as Traditional categorization methods and may point to significant future performance improvements in both domains. In some cases these results were better than Apté's.
机译:本文将遗传程序设计应用于文本分类问题。将遗传编程算法应用于一组新闻文章,以发展确定该文章是否属于特定类别的程序。程序是从一组初始函数和常量中随机生成的。错误分配量最少的程序在遗传编程算法的后续迭代中的重组选择中被偏爱。解决方案的形式没有像其他文本分类方法那样先验确定。遗传分析程序使用的功能和常数的基本集是预先指定的,可能包括三个基本逻辑功能和一组词汇。可以将其他基础函数集提供给遗传算法以获得不同的程序。这些函数和常数的组合形式由遗传算法随机确定。结果表明,在这种情况下,遗传程序设计方法与Apté等人描述的其他决策树或规则诱导方法一样好,并且稍好。等[Apté1994]。遗传编程方法使用了一组较简单的特征和功能:没有词,没有显式停用词,本地字典,布尔函数。遗传编程实现的F 1 度量分类性能为80.4%,与传统布尔规则归纳方法的收支平衡性能为78.5%相当。它与规则归纳法的80.5%收支平衡性能相当,具有更复杂的特征集,例如单词频率[Apté1994]。研究了遗传编程文本分类的特征,以了解遗传编程方法对词汇量,人口规模,训​​练和测试集选择方法的敏感性。研究了路透社文章语料库(Lewis-21578)的时间特征。该结果对于遗传编程和传统分类方法均很有意义,并且可能表明这两个领域的未来性能都有显着提高。在某些情况下,这些结果要比Apté的要好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号