首页> 外文学位 >Data mining in tree-based models and large-scale contingency tables.
【24h】

Data mining in tree-based models and large-scale contingency tables.

机译:基于树的模型和大规模列联表中的数据挖掘。

获取原文
获取原文并翻译 | 示例

摘要

This thesis is composed of two parts. The first part pertains to tree-based models. The second part deals with multiple testing in large-scale contingency tables. Tree-based models have gained enormous popularity in statistical modeling and data mining. We propose a novel tree-pruning algorithm called frontier-based tree-pruning algorithm (FBP). The new method has an order of computational complexity comparable to cost-complexity pruning (CCP). Regarding tree pruning, it provides a full spectrum of information. Numerical study on real data sets reveals a surprise: in the complexity-penalization approach, most of the tree sizes are inadmissible. FBP facilitates a more faithful implementation of cross validation, which is favored by simulations.; One of the most common test procedures using two-way contingency tables is the test of independence between two categorizations. Current test procedures such as chi-square or likelihood ratio tests provide overall independency but bring limited information about the nature of the association in contingency tables. We propose an approach of testing independence of categories in individual cells of contingency tables based on a multiple testing framework. We then employ the proposed method to identify the patterns of pair-wise associations between amino acids involved in beta-sheet bridges of proteins. We identify a number of amino acid pairs that exhibit either strong or weak association. These patterns provide useful information for algorithms that predict secondary and tertiary structures of proteins.
机译:本文由两部分组成。第一部分涉及基于树的模型。第二部分处理大型列联表中的多个测试。基于树的模型在统计建模和数据挖掘中获得了极大的普及。我们提出了一种新颖的树修剪算法,称为基于边界的树修剪算法(FBP)。新方法的计算复杂度可与成本复杂度修剪(CCP)相媲美。关于树修剪,它提供了完整的信息。对真实数据集的数值研究揭示了一个惊喜:在复杂度惩罚方法中,大多数树大小是不允许的。 FBP促进了交叉验证的更加忠实的实现,这受到仿真的青睐。使用双向列联表的最常见测试程序之一是测试两种分类之间的独立性。当前的测试程序(例如卡方检验或似然比检验)提供了总体独立性,但在列联表中仅提供了有关关联性质的有限信息。我们提出了一种基于多重测试框架的测试列联表各个单元格中类别独立性的方法。然后,我们采用提出的方法来确定参与蛋白质的β-折叠桥的氨基酸之间成对关联的模式。我们确定了显示出强或弱关联的许多氨基酸对。这些模式为预测蛋白质二级和三级结构的算法提供了有用的信息。

著录项

  • 作者

    Kim, Seoung Bum.;

  • 作者单位

    Georgia Institute of Technology.;

  • 授予单位 Georgia Institute of Technology.;
  • 学科 Engineering Industrial.
  • 学位 Ph.D.
  • 年度 2005
  • 页码 160 p.
  • 总页数 160
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 一般工业技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号