首页> 外文学位 >Scalable machine learning using applications in bioinformatics and cybercrime.

【24h】

Scalable machine learning using applications in bioinformatics and cybercrime.

机译：使用生物信息学和网络犯罪中的应用程序进行可扩展的机器学习。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

This thesis contributes multiple scalable machine learning applications in the fields of bioinformatics and cybercrime. A highly parallel framework for machine learning, called the Collaborative Analytics Framework is also presented. The framework leverages shared memory to efficiently process large datasets. Applications in bioinformatics gene sequence classification are implemented. In the gene sequence classification problem, unlabeled gene sequences are matched to sequences labeled with known taxonomies. Existing alignment-based methods are inefficient in practice and must balance performance by using shorter word lengths. Prior alignment-free methods do not scale efficiently as the number of trained sequences grows. A new alignment-free method, called Strand, is introduced. STRAND achieves as good or better accuracy than existing alignment-free methods, at improved speed and a reduced in-memory training database footprint. STRAND achieves this by exploiting a form of lossy compression called minhashing as part of an in-memory MapReduce-style framework. Strand is also applied to shotgun classification challenges for purposes of Abundance Estimation. Scalable machine learning applications are then applied to multiple cybercrime datasets. First, a method is presented to cluster criminal websites which are loose copies of one another. This general method is then applied to two specific cases, detecting thousands of copied Ponzi Scheme and Escrow Fraud websites. Second, a binary classifier is developed to examine search results for luxury goods to identify websites selling knock-offs. Finally, the Strand application is also used to detect various classes of malware data treating each malware's binary content as a gene sequence and successfully detecting large volumes of malware files with a high level accuracy and processing efficiency.

机译：本论文为生物信息学和网络犯罪领域的多种可扩展的机器学习应用做出了贡献。还介绍了一个高度并行的机器学习框架，称为协作分析框架。该框架利用共享内存有效地处理大型数据集。实现了在生物信息学基因序列分类中的应用。在基因序列分类问题中，未标记的基因序列与用已知分类法标记的序列匹配。现有的基于对齐方式的方法在实践中效率低下，并且必须通过使用较短的字长来平衡性能。随着训练序列数量的增长，现有的无比对方法不能有效地扩展规模。引入了一种称为Strand的新的免对齐方法。与现有的无对齐方法相比，STRAND可以达到更好或更高的准确性，而且速度更快，内存中的培训数据库占用空间也更少。 STRAND通过利用一种称为minhashing的有损压缩形式来实现这一目标，该形式是内存MapReduce样式框架的一部分。为了进行丰度估算，Strand也可用于shot弹枪的分类挑战。然后将可伸缩的机器学习应用程序应用于多个网络犯罪数据集。首先，提出了一种对犯罪网站进行聚类的方法。然后将此通用方法应用于两种特定情况，即检测成千上万个复制的庞氏骗局和托管欺诈网站。其次，开发了一个二元分类器来检查奢侈品的搜索结果，以识别销售仿冒商品的网站。最后，Strand应用程序还用于检测各种类型的恶意软件数据，将每个恶意软件的二进制内容视为基因序列，并以高水平的准确性和处理效率成功检测大量的恶意软件文件。

著录项

作者
Drew, Jake M.;
展开▼
作者单位

Southern Methodist University.;

展开▼
授予单位 Southern Methodist University.;
学科 Computer science.;Bioinformatics.
学位 Ph.D.
年度 2015
页码 205 p.
总页数 205
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Machine Learning Algorithm for Bioinformatics Applications [J] . IEI News group IEI News . 2019,第11期

机译：生物信息学应用机器学习算法
2. APPLICATION OF MACHINE LEARNING METHODS IN BIOINFORMATICS [J] . Wu S. F. Basic & clinical pharmacology & toxicology. . 2016,第Suppla1期

机译：机器学习方法在生物信息学中的应用
3. Scaled Tucker manifold and its application to large-scale machine learning [J] . HIROYUKI KASAI, BAMDEV MISHRA 電子情報通信学会技術研究報告. スマートインフォメディアシステム . 2016,第204期

机译：规模化塔克流形及其在大规模机器学习中的应用
4. Implementation of Grey Scale Normalization in Machine Learning Artificial Intelligence for Bioinformatics using Convolutional Neural Networks [C] . Divya Kothari, Mayank Patel, Ajay Kumar Sharma International Conference on Inventive Computation Technologies . 2021

机译：利用卷积神经网络实现生物信息学的机器学习与人工智能灰度标准化的实施
5. Signal processing and machine learning for bioinformatics applications [D] . Ghanat Bari, Mehrab. 2016

机译：生物信息学应用的信号处理和机器学习
6. A Mixed Quantum Chemistry/Machine Learning Approachfor the Fast and Accurate Prediction of Biochemical Redox Potentialsand Its Large-Scale Application to 315 000 Redox Reactions [O] . Adrian Jinich, #, Benjamin Sanchez-Lengeling, 2019

机译：混合量子化学/机器学习方法快速准确地预测生化氧化还原电势及其在315 000氧化还原反应中的大规模应用
7. The Bioinformatics Bookshelf: Teach Yourself Computational Biology? Bioinformatics: The Machine Learning Approach By Pierre Baldi and Soren Brunak Cambridge, MA: MIT Press (1998). 351 pp. $40.00; Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins Edited by Andreas D. Baxevanis and B. F. Francis Ouellette New York: Wiley-lnterscience (1998). 370 pp. $59.95; Guide to Human Genome Computing, Second Edition Edited by Martin J. Bishop San Diego, CA: Academic Press (1998). 306 pp. $69.95; Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids By Richard Durbin, Sean Eddy, Anders Krogh, and Graeme Mitchison Cambridge: Cambridge University Press (1998). 356 pp. $34.95; Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology By Dan Gusfield Cambridge: Cambridge University Press (1997). 534 pp. $59.95; Introduction to Computational Molecular Biology By Joao Setubal and Joao Meidanis Boston: PWS Publishing (1997). 296 pp. $61.95 [O] . Pickeral Oxana K, Boguski Mark S 1999

机译：生物信息学书架：自学计算生物学吗？生物信息学：机器学习方法，作者：Pierre Baldi和Soren Brunak剑桥，麻省：麻省理工学院出版社（1998）。 351页，$ 40.00；生物信息学：由Andreas D. Baxevanis和B. F. Francis Ouellette编辑的基因和蛋白质分析实用指南纽约：Wiley-Interscience（1998）。 370页，$ 59.95；《人类基因组计算指南》，第二版，由马丁·J·毕晓普（Martin J. Bishop）编辑，加利福尼亚州圣地亚哥：学术出版社（1998）。 306页，$ 69.95；生物序列分析：蛋白质和核酸的概率模型Richard Durbin，Sean Eddy，Anders Krogh和Graeme Mitchison剑桥：剑桥大学出版社（1998年）。 356页，$ 34.95；字符串，树和序列上的算法：计算机科学和计算生物学Dan Danssfield剑桥：剑桥大学出版社（1997年）。 534页，$ 59.95； Joao Setubal和Joao Meidanis Boston撰写的《计算分子生物学概论》：PWS出版（1997）。 296羽61.95美元

Scalable machine learning using applications in bioinformatics and cybercrime.

摘要

著录项

相似文献

相关主题

期刊订阅