首页> 美国卫生研究院文献>other >Visual management of large scale data mining projects
【2h】

Visual management of large scale data mining projects

机译:大型数据挖掘项目的可视化管理

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

This paper describes a unified framework for visualizing the preparations for, and results of, hundreds of machine learning experiments. These experiments were designed to improve the accuracy of enzyme functional predictions from sequence, and in many cases were successful. Our system provides graphical user interfaces for defining and exploring training datasets and various representational alternatives, for inspecting the hypotheses induced by various types of learning algorithms, for visualizing the global results, and for inspecting in detail results for specific training sets (functions) and examples (proteins). The visualization tools serve as a navigational aid through a large amount of sequence data and induced knowledge. They provided significant help in understanding both the significance and the underlying biological explanations of our successes and failures. Using these visualizations it was possible to efficiently identify weaknesses of the modular sequence representations and induction algorithms which suggest better learning strategies. The context in which our data mining visualization toolkit was developed was the problem of accurately predicting enzyme function from protein sequence data. Previous work demonstrated that approximately 6% of enzyme protein sequences are likely to be assigned incorrect functions on the basis of sequence similarity alone. In order to test the hypothesis that more detailed sequence analysis using machine learning techniques and modular domain representations could address many of these failures, we designed a series of more than 250 experiments using information-theoretic decision tree induction and naive Bayesian learning on local sequence domain representations of problematic enzyme function classes. In more than half of these cases, our methods were able to perfectly discriminate among various possible functions of similar sequences . We developed and tested our visualization techniques on this application.
机译:本文描述了一个统一的框架,用于可视化数百个机器学习实验的准备和结果。设计这些实验是为了提高序列预测酶功能的准确性,并且在许多情况下是成功的。我们的系统提供图形用户界面,用于定义和探索训练数据集和各种代表性的选择,检查由各种类型的学习算法引起的假设,可视化全局结果以及为特定的训练集(功能)和示例详细检查结果(蛋白质)。可视化工具通过大量序列数据和诱导知识充当导航辅助。他们为理解我们的成功和失败的意义和潜在的生物学解释提供了重要帮助。使用这些可视化,可以有效地识别模块化序列表示和归纳算法的弱点,这些弱点建议了更好的学习策略。开发我们的数据挖掘可视化工具包的上下文是根据蛋白质序列数据准确预测酶功能的问题。先前的工作 表明,仅基于序列相似性,大约6%的酶蛋白序列可能被分配了错误的功能。为了检验这种假设,即使用机器学习技术和模块化域表示法进行更详细的序列分析可以解决其中的许多失败,我们设计了一系列250多个实验,使用信息理论决策树归纳和朴素贝叶斯学习在局部序列域上进行有问题的酶功能类别的表示。在一半以上的情况下,我们的方法能够完美地区分相似序列 的各种可能功能。我们在此应用程序上开发并测试了可视化技术。

著录项

  • 期刊名称 other
  • 作者

    I. Shah; L. Hunter;

  • 作者单位
  • 年(卷),期 -1(278–290),-1
  • 年度 -1
  • 页码 278–290
  • 总页数 13
  • 原文格式 PDF
  • 正文语种
  • 中图分类
  • 关键词

  • 入库时间 2022-08-21 11:34:15

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号