首页> 外文学位 >Towards accurate and efficient classification: A discriminative and frequent pattern-based approach.
【24h】

Towards accurate and efficient classification: A discriminative and frequent pattern-based approach.

机译:朝着准确有效的分类迈进:一种基于判别性且基于模式的频繁方法。

获取原文
获取原文并翻译 | 示例

摘要

Classification is a core method widely studied in machine learning, statistics, and data mining. A lot of classification methods have been proposed in literature, such as Support Vector Machines, Decision Trees, and Bayesian Networks, most of which assume that the input data is in a feature vector representation. However, in some classification problems, the predefined feature space is not discriminative enough to distinguish between different classes. More seriously, in many other applications, the input data has very complex structures, but with no initial feature vector representation, such as transaction data (e.g., customer shopping transactions), sequences (e.g., protein sequences and software execution traces), graphs (e.g., chemical compounds and molecules, social and biological networks), semi-structured data (e.g., XML documents), and text data. For both scenarios, a primary question is how to construct a discriminative and compact feature set, on the basis of which, classification could be performed to achieve good classification performance. Although a lot of kernel-based approaches have been proposed to transform the feature space and, as a way to measure the similarity between two data objects, the implicit definition of feature space makes the kernel-based approach hard to interpret, and the high computational complexity makes it hard to scale to large problem sizes. A concrete example of complex structural data classification is classifying chemical compounds to various classes ( e.g., toxic vs. nontoxic, active vs. inactive), where a key challenge is how to construct discriminative graph features. While simple features such as atoms and links are too simple to preserve the structural information, graph kernel methods make it hard to interpret the classifiers.;In this dissertation, I proposed to use frequent patterns as higher-order and discriminative features to characterize data, especially complex structural data, and thus enhance the classification power. Towards this goal, I designed a framework of discriminative frequent pattern-based classification which has been shown to improve the classification performance significantly. Theoretical analysis is provided to reveal the association between a feature's frequency and its discriminative power, thus demonstrate that frequent pattern is a good candidate as discriminative feature.;Due to the explosive nature of frequent pattern mining, the frequent pattern-based feature construction could be a computational bottleneck, if the whole set of frequent patterns w.r.t. a minimum support threshold are generated. To overcome this computational bottleneck, I proposed two solutions: DDPMine and LEAP which directly mine the most discriminative features without generating the complete set. Both methods have been shown to improve efficiency while maintaining the classification accuracy.;I further applied the discriminative frequent pattern-based classification to classifying chemical compounds with very skewed class distribution, which poses challenges for both feature construction and model learning. An ensemble framework which includes the ensembles in both the data space and the feature space is proposed to handle the challenges and shown to achieve good classification performance.;In conclusion, the framework of discriminative frequent pattern-based classification could lead to a highly accurate, efficient and interpretable classifier on complex data. The pattern-based classification technique would have great impact in a wide range of applications including text categorization, chemical compound classification, software behavior analysis and so on.
机译:分类是在机器学习,统计和数据挖掘中广泛研究的一种核心方法。文献中已经提出了许多分类方法,例如支持向量机,决策树和贝叶斯网络,其中大多数假设输入数据采用特征向量表示。但是,在某些分类问题中,预定义的特征空间不足以区分不同的类。更严重的是,在许多其他应用中,输入数据具有非常复杂的结构,但没有初始特征矢量表示,例如交易数据(例如,客户购物交易),序列(例如,蛋白质序列和软件执行轨迹),图形(例如化合物和分子,社会和生物网络),半结构化数据(例如XML文档)和文本数据。对于这两种情况,一个主要问题是如何构造一个可区分的紧凑特征集,在此基础上可以执行分类以获得良好的分类性能。尽管已经提出了许多基于内核的方法来转换特征空间,并且作为测量两个数据对象之间相似性的一种方法,但是特征空间的隐式定义使基于内核的方法难以解释,并且计算量大。复杂性使得很难扩展到大问题。复杂的结构数据分类的一个具体示例是将化合物分类为各种类别(例如,有毒与无毒,活性与非活性),其中关键的挑战是如何构造判别式图形特征。尽管原子和链接之类的简单特征太简单而无法保存结构信息,但是图核方法却使得解释这些分类器变得困难。;在本文中,我建议使用频繁模式作为高阶和区分性特征来表征数据,特别是复杂的结构数据,从而增强了分类能力。为了实现这一目标,我设计了一个基于频繁模式的判别性分类框架,该框架已被证明可以显着提高分类性能。通过理论分析揭示了特征频率与判别力之间的联系,从而证明了频繁模式是判别特征的良好候选者。由于频繁模式挖掘的爆炸性,可以基于频繁模式进行特征构造一个计算瓶颈,如果整个频繁模式集生成最小支持阈值。为克服此计算瓶颈,我提出了两种解决方案:DDPMine和LEAP,它们直接挖掘最具有区别性的功能而不生成完整的功能集。这两种方法都显示出提高效率的同时保持了分类的准确性。我进一步将基于频繁模式的判别式分类用于分类分布非常偏斜的化合物的分类,这给特征构造和模型学习带来了挑战。提出了一个包含数据空间和特征空间中的集合的集合框架来应对挑战并显示出良好的分类性能。;总而言之,基于频繁模式的判别性判别框架可能会导致高度准确,复杂数据的高效可解释分类器。基于模式的分类技术将在包括文本分类,化合物分类,软件行为分析等在内的广泛应用中产生巨大影响。

著录项

  • 作者

    Cheng, Hong.;

  • 作者单位

    University of Illinois at Urbana-Champaign.;

  • 授予单位 University of Illinois at Urbana-Champaign.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2008
  • 页码 101 p.
  • 总页数 101
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

  • 入库时间 2022-08-17 11:38:40

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号