首页> 外文学位 >Scalable machine learning using applications in bioinformatics and cybercrime.
【24h】

Scalable machine learning using applications in bioinformatics and cybercrime.

机译:使用生物信息学和网络犯罪中的应用程序进行可扩展的机器学习。

获取原文
获取原文并翻译 | 示例

摘要

This thesis contributes multiple scalable machine learning applications in the fields of bioinformatics and cybercrime. A highly parallel framework for machine learning, called the Collaborative Analytics Framework is also presented. The framework leverages shared memory to efficiently process large datasets. Applications in bioinformatics gene sequence classification are implemented. In the gene sequence classification problem, unlabeled gene sequences are matched to sequences labeled with known taxonomies. Existing alignment-based methods are inefficient in practice and must balance performance by using shorter word lengths. Prior alignment-free methods do not scale efficiently as the number of trained sequences grows. A new alignment-free method, called Strand, is introduced. STRAND achieves as good or better accuracy than existing alignment-free methods, at improved speed and a reduced in-memory training database footprint. STRAND achieves this by exploiting a form of lossy compression called minhashing as part of an in-memory MapReduce-style framework. Strand is also applied to shotgun classification challenges for purposes of Abundance Estimation. Scalable machine learning applications are then applied to multiple cybercrime datasets. First, a method is presented to cluster criminal websites which are loose copies of one another. This general method is then applied to two specific cases, detecting thousands of copied Ponzi Scheme and Escrow Fraud websites. Second, a binary classifier is developed to examine search results for luxury goods to identify websites selling knock-offs. Finally, the Strand application is also used to detect various classes of malware data treating each malware's binary content as a gene sequence and successfully detecting large volumes of malware files with a high level accuracy and processing efficiency.
机译:本论文为生物信息学和网络犯罪领域的多种可扩展的机器学习应用做出了贡献。还介绍了一个高度并行的机器学习框架,称为协作分析框架。该框架利用共享内存有效地处理大型数据集。实现了在生物信息学基因序列分类中的应用。在基因序列分类问题中,未标记的基因序列与用已知分类法标记的序列匹配。现有的基于对齐方式的方法在实践中效率低下,并且必须通过使用较短的字长来平衡性能。随着训练序列数量的增长,现有的无比对方法不能有效地扩展规模。引入了一种称为Strand的新的免对齐方法。与现有的无对齐方法相比,STRAND可以达到更好或更高的准确性,而且速度更快,内存中的培训数据库占用空间也更少。 STRAND通过利用一种称为minhashing的有损压缩形式来实现这一目标,该形式是内存MapReduce样式框架的一部分。为了进行丰度估算,Strand也可用于shot弹枪的分类挑战。然后将可伸缩的机器学习应用程序应用于多个网络犯罪数据集。首先,提出了一种对犯罪网站进行聚类的方法。然后将此通用方法应用于两种特定情况,即检测成千上万个复制的庞氏骗局和托管欺诈网站。其次,开发了一个二元分类器来检查奢侈品的搜索结果,以识别销售仿冒商品的网站。最后,Strand应用程序还用于检测各种类型的恶意软件数据,将每个恶意软件的二进制内容视为基因序列,并以高水平的准确性和处理效率成功检测大量的恶意软件文件。

著录项

  • 作者

    Drew, Jake M.;

  • 作者单位

    Southern Methodist University.;

  • 授予单位 Southern Methodist University.;
  • 学科 Computer science.;Bioinformatics.
  • 学位 Ph.D.
  • 年度 2015
  • 页码 205 p.
  • 总页数 205
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号