首页> 外文学位 >Feature selection for robust knowledge discovery from data.
【24h】

Feature selection for robust knowledge discovery from data.

机译:从数据中进行可靠的知识发现的功能选择。

获取原文
获取原文并翻译 | 示例

摘要

Modern information systems provide the ability to record, store, retrieve, and transmit massive amounts of data, and indeed, such systems have become a routine part of daily life for many people. While computers are adept at handling information, it has long been the goal of the knowledge discovery from data (KDD) community to enable computers to extract meaningful knowledge from this information – knowledge that would be otherwise lost to humans in the sheer volume of data. Feature selection techniques are often used in this discovery process to help combat the “curse of dimensionality,'' or the tremendous sample requirements that occur with high-dimensional data. Traditional feature selection algorithms only considered data sets with many available samples, and where the distribution of samples is assumed to accurately represent the entire population (an unbiased sample set). There are many situations where these assumptions do not hold, causing existing feature selection techniques to break down and in turn prevent automated KDD processes from being used. Each of these situations presents their own unique challenges to feature selection. In this dissertation, I analyze these challenges and develop feature selection algorithms that can perform robustly – specifically on small sample-size problems and in domains where biased data is unavoidable. In small sample-size data, results show that traditional feature selection algorithms produce unstable selected subsets, that is, subset membership will change with perturbations to the sample set. This reduces confidence that the selected subsets are truly relevant to the learning target and casts doubts on any extracted knowledge. Using traditional feature selection techniques in biased data will lead to heavily biased models that are not informative with respect to the entire problem. In dynamic situations where biased data is encountered, such as in reinforcement learning, this can prevent learning from proceeding at all, breaking the KDD process. In this dissertation, I present several techniques for overcoming these issues and demonstrate their effectiveness on diverse applications. Furthermore, I show that methods we developed can significantly outperform state-of-the-art feature selection algorithms in each application.
机译:现代信息系统提供了记录,存储,检索和传输大量数据的能力,实际上,这样的系统已成为许多人日常生活的一部分。尽管计算机擅长处理信息,但是从数据(KDD)社区发现知识的目标一直是使计算机能够从该信息中提取有意义的知识,否则这些知识将在大量数据中丢失给人类。在发现过程中经常使用特征选择技术来帮助应对“维数诅咒”或高维数据出现的巨大样本需求。传统的特征选择算法仅考虑具有许多可用样本的数据集,并且假定样本分布准确地代表了整个总体(无偏样本集)。在许多情况下,这些假设都不成立,从而导致现有的功能选择技术崩溃,进而导致无法使用自动KDD流程。这些情况中的每一种都对特征选择提出了自己独特的挑战。在这篇论文中,我分析了这些挑战,并开发了性能强大的特征选择算法,特别是在样本量较小的问题以及不可避免的有偏差数据的领域中。在小样本数据中,结果表明传统特征选择算法会生成不稳定的选定子集,也就是说,子集成员资格会随着对样本集的扰动而变化。这降低了所选子集与学习目标确实相关的信心,并对任何提取的知识产生了疑问。在有偏见的数据中使用传统的特征选择技术会导致严重偏见的模型,而这些模型对于整个问题并没有提供任何信息。在遇到有偏见的数据的动态情况下(例如在强化学习中),这可能完全阻止学习继续进行,从而破坏了KDD流程。在本文中,我提出了几种克服这些问题的技术,并展示了它们在各种应用中的有效性。此外,我证明了我们开发的方法在每种应用中都可以大大胜过最新的特征选择算法。

著录项

  • 作者

    Loscalzo, Steven.;

  • 作者单位

    State University of New York at Binghamton.;

  • 授予单位 State University of New York at Binghamton.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2012
  • 页码 186 p.
  • 总页数 186
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 水产、渔业;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号