首页> 外文学位 >Vectorization Generalizations in Genomics and Transportation.
【24h】

Vectorization Generalizations in Genomics and Transportation.

机译:基因组学和运输学中的矢量化概述。

获取原文
获取原文并翻译 | 示例

摘要

The process of transforming a sample to a pair of input and output vectors is sometimes referred to as vectorization". Those samples and their respective vectorizations are used within various learning algorithms to create a model that makes predictions about unknown output vectors given known input vectors. Finding a good vectorization and algorithm combination is the source of a lot of work in various statistical learning applications. This thesis aims to compare, generalize, and improve existing vectorizations within the fields of bioinformatics and transportation.;There have been many proposed methods for phylogenetic classification of viruses. Performing these classifications in a timely manner is of interest to researchers and to those ensuring national security. While multiple sequence alignment remains the tool of choice for practitioners for reasons of interpretability, alignment-free methods have gained popularity due to the substantial increases in speed they provide.;We first extend the natural vector description of genomes to handle viruses and various issues unique to viral genomes. We provide an alternative definition of the natural vector that is able to handle ambiguous nucleotides. We provide a bound on the distance induced by the natural vector between a genome and a mutation of that genome due to a single-nucleotide polymorphism (SNP).;Applying these methods, we test the ability of the natural vector to accurately classify viruses using the National Center for Biotechnology Information's (NCBI) collection of 2044 virus reference sequences (RefSeq) that covers the range of known viruses derived from all 7 Baltimore classes, 73 families and 253 genera. We then compare these classification results to the predominant method of measuring genome similarity, multiple sequence alignment (MSA).;We then present a new family of alignment-free vectorizations of the genome that maintains the speed of existing alignment-free methods and incorporates the interpretability of sequence alignment. This new alignment-free vectorization uses the frequency of genomic words (k-mers), as is done in the composition vector, and incorporates descriptive statistics of those k-mers' positional information, as inspired by the natural vector.;For the first time, we provide a thorough comparison of 5 popular characterizations of genome similarity using k-nearest neighbor classification, and evaluate these on two collections of viruses. The first is the NCBI RefSeq collection above. This informs us of the quality of the various vectorizations' high-level classifications; i.e. Baltimore class, family, and genus. The second collection comes from the online PAirwise Sequence Classification (PASC) tool and consists of 53 families/genera of curated viruses for a total of 9545 viruses. This collection informs us of the quality of the various vectorizations' low-level classifications; i.e. species. From these classification results we make recommendations for reclassification of some viruses.;The prediction of bus arrival times is important for users of public transportation. This problem has received some attention with various authors proposing different vectorizations and different representations of the problem. For example, some propose to have a different models for different times of the day, while others suggest using the same model throughout the day that uses the posted schedule as a parameter within the model. We first generalize the vectorizations and representations existing in the literature. We then propose a method of recovering the schedule and show that the use of this schedule uniformly improves all existing methods using 3 weeks of Chicago Transit Authority (CTA) bus data.;Lastly, we analyze data usage from reporting real-time GPS traces. The problem of tracking a GPS device relies upon predicting vehicle location in general, as opposed to predicting vehicle location on fixed routes as above. We propose an online method that uses historical location data. We compare this method of location prediction to commonly used methods of location prediction using a metric based on the efficiency of mobile data usage. Comparison of 12 different tracking methods are done on two data sets. The first from Microsoft Research (MSR) and the second from the UIC shuttle. We show that at low-error tolerances the methods are equivalent, but at higher-error tolerances the proposed method is greatly more efficient.
机译:将样本转换为一对输入和输出向量的过程有时称为“向量化”。这些样本及其各自的向量化用于各种学习算法中,以创建一个模型,该模型对已知输入向量的未知输出向量进行预测。在各种统计学习应用中寻找良好的矢量化和算法组合是许多工作的源头,本文旨在比较,概括和改进生物信息学和运输领域中现有的矢量化。病毒的分类:研究人员和确保国家安全的人员应及时进行这些分类,尽管出于可解释性的考虑,多序列比对仍然是从业人员的选择工具,但由于无序列比对方法的广泛应用,其流行度很高。他们提供的速度增加。 st扩展了基因组的自然载体描述,以处理病毒和病毒基因组特有的各种问题。我们提供了能够处理歧义核苷酸的天然载体的替代定义。由于单核苷酸多态性(SNP),我们提供了自然载体在基因组与该基因组突变之间诱导的距离的界限。应用这些方法,我们测试了自然载体使用以下方法准确分类病毒的能力美国国家生物技术信息中心(NCBI)收集了2044个病毒参考序列(RefSeq),涵盖了来自所有7个巴尔的摩类,73个家族和253属的已知病毒的范围。然后,我们将这些分类结果与测量基因组相似性的主要方法(多序列比对(MSA))进行比较。然后,我们提出了一个新的基因组无比对向量化家族,该家族保持了现有无比对方法的速度,并结合了序列比对的可解释性。这种新的无比对矢量化使用了组成词向量中的基因词(k-mers)的频率,并结合了自然向量的启发,结合了这些k-mers位置信息的描述性统计数据。当时,我们使用k最近邻分类法对5个流行的基因组相似性特征进行了全面比较,并对两个病毒集合进行了评估。第一个是上面的NCBI RefSeq集合。这告诉我们各种矢量化的高级分类的质量;即巴尔的摩的阶级,家庭和属。第二个集合来自在线Pairwise序列分类(PASC)工具,它由53个家族/属的精选病毒组成,共计9545种病毒。这个集合告诉我们各种矢量化的低级分类的质量;即物种。根据这些分类结果,我们建议对某些病毒进行重新分类。;公交车到站时间的预测对公共交通用户至关重要。这个问题已经引起了许多提出不同向量化和不同表示形式的作者的关注。例如,有些人建议在一天中的不同时间使用不同的模型,而另一些人建议在一天中使用相同的模型,该模型将发布的日程表用作模型中的参数。我们首先概括文献中存在的向量化和表示。然后,我们提出了一种恢复时间表的方法,并表明使用该时间表可以使用3周的芝加哥公交管理局(CTA)公交数据统一改进所有现有方法。最后,我们从报告实时GPS跟踪数据中分析了数据使用情况。跟踪GPS设备的问题通常依赖于预测车辆位置,而不是如上所述的预测固定路线上的车辆位置。我们提出了一种使用历史位置数据的在线方法。我们将这种位置预测方法与基于位置的移动数据使用效率的度量标准与常用的位置预测方法进行比较。在两个数据集上比较了12种不同的跟踪方法。第一个来自Microsoft Research(MSR),第二个来自UIC班车。我们表明,在低误差容限下,该方法是等效的,但在较高误差容限下,所提出的方法效率更高。

著录项

  • 作者

    Hernandez, Troy A.;

  • 作者单位

    University of Illinois at Chicago.;

  • 授予单位 University of Illinois at Chicago.;
  • 学科 Statistics.;Biology Bioinformatics.;Computer Science.
  • 学位 Ph.D.
  • 年度 2013
  • 页码 136 p.
  • 总页数 136
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 遥感技术 ;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号