首页> 外文学位 >An evolutionary machine learning framework for big data sequence mining.
【24h】

An evolutionary machine learning framework for big data sequence mining.

机译:大数据序列挖掘的进化机器学习框架。

获取原文
获取原文并翻译 | 示例

摘要

Sequence classification is an important problem in many real-world applications. Unlike other machine learning data, there are no explicit" features or signals in sequence data that can help traditional machine learning algorithms learn and predict from the data. Sequence data exhibits inter-relationships in the elements that are important in understanding and predicting future sequences. However, finding these relationships is proven to be an NPhard problem. When we use naive enumerations of combinations of elements or brute force" iterative approaches for defining these features they often result in poor predictions. Some algorithms which perform well in prediction lack transparency, i.e., the discriminating features generated by these methods are not easily identifiable. In addition, the size of the sequence-based datasets presents practical challenges to most learning algorithms. Most sequence-based datasets contain millions or even billions of instances, for example, the genome-wide sequences of organisms in bioinformatics. At these sizes, classic learning algorithms often become prohibitively expensive, making scalability an important issue. Therefore, there is a need for an approach that can help find features/signals in complex sequences, oer meaningful discriminators, produce good predictions, and can scale well in time and space. This dissertation addresses the above issues by designing a comprehensive approach in the form of the Evolutionary Machine Learner (EML) framework. This framework can be employed on sequence-based datasets to generate explicit, human-recognizable features while solving the scalability issue. EML framework consists of a novel EA-based feature generation (EFG) algorithm for automatic feature construction. By modeling four complex sequencing problems in bioinformatics and generating meaningful, human-understandable features with comparable or better accuracy than the state of the art algorithms, the power and usefulness of the EFG algorithm is demonstrated. The EFG algorithm is also validated by applying it to time series classification problems showing the generic nature of the algorithm in finding the important discriminating patterns that assist in modeling sequence based data. EML framework addresses the scalability issue by means of a novel, parallel scalable machine learning algorithm (PSBML) based on spatially structured evolutionary algorithms. PSBML is validated on real-world big data" classification problems for various properties of meta-learning, scalability and noise resilience using well known benchmark datasets. The PSBML algorithm is also proven theoretically to be a large margin classifier with linear scalability in training time and space, giving it a unique distinction among the existing large scale learning algorithms. Finally, the EML framework is validated on a large genome-wide bioinformatics classification problem and a large time series problem, showing that the combined algorithms achieve higher predictive performance, training time speed up, and the ability to produce human-understandable discriminating signals as features.
机译:在许多实际应用中,序列分类是一个重要的问题。与其他机器学习数据不同,序列数据中没有明显的“特征”或信号可以帮助传统的机器学习算法从数据中学习和预测。序列数据在元素中表现出相互关系,这对于理解和预测未来序列很重要。但是,找到这些关系被证明是一个NPhard问题。当我们使用元素或蛮力组合的幼稚枚举“迭代方法来定义这些特征时,通常会导致较差的预测。在预测中表现良好的某些算法缺乏透明度,即,通过这些方法生成的区分特征不容易识别。另外,基于序列的数据集的大小对大多数学习算法提出了实际挑战。大多数基于序列的数据集包含数百万甚至数十亿个实例,例如,生物信息学中生物的全基因组序列。在这样的规模下,经典的学习算法通常变得过分昂贵,从而使可伸缩性成为重要问题。因此,需要一种方法,该方法可以帮助找到复杂序列中的特征/信号,提供有意义的鉴别符,产生良好的预测,并且可以在时间和空间上很好地缩放。本文通过设计一种进化的机器学习者(EML)框架形式的综合方法解决了上述问题。此框架可用于基于序列的数据集,以生成明确的,人类可识别的特征,同时解决可伸缩性问题。 EML框架包含一个新颖的基于EA的自动生成特征的算法(EFG)。通过对生物信息学中的四个复杂的测序问题进行建模,并生成具有比现有算法更高或更高的准确性的,人类可理解的有意义的特征,从而证明了EFG算法的强大功能和实用性。 EFG算法还通过将其应用于时间序列分类问题而得到验证,这些问题表明了算法在寻找有助于区分基于序列的数据的重要区分模式时的通用性。 EML框架通过基于空间结构演化算法的新型并行可伸缩机器学习算法(PSBML)解决了可伸缩性问题。 PSBML已使用众所周知的基准数据集针对元学习,可伸缩性和抗噪声能力的各种属性针对现实世界的大数据分类问题进行了验证。PSBML算法在理论上也被证明是一种大型边际分类器,在训练时间和最后,针对一个大型的全基因组生物信息学分类问题和一个较大的时间序列问题对EML框架进行了验证,表明该组合算法具有较高的预测性能,训练时间并以产生人类可理解的区分信号为特征。

著录项

  • 作者

    Kamath, Uday Krishna.;

  • 作者单位

    George Mason University.;

  • 授予单位 George Mason University.;
  • 学科 Computer Science.;Information Science.;Biology Bioinformatics.
  • 学位 Ph.D.
  • 年度 2014
  • 页码 177 p.
  • 总页数 177
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号