首页> 外文会议>International conference on modeling and analysis of semiconductor manufacturing >DATA MINING, STRIP MINING AND OTHER HISTORICAL DATA HIGH JINX
【24h】

DATA MINING, STRIP MINING AND OTHER HISTORICAL DATA HIGH JINX

机译:数据挖掘,剥离挖掘等历史数据高Jinx

获取原文

摘要

When one decides to embark on a data mining project there are two key tasks that must be completed at the very beginning: clearly defining the goals and expectations of the project, and preparing the data properly before any data mining or modeling is performed. When data mining with historical data sets one needs to understand several aspects of the data: 晇ariable data types, data structures, existence of potential outliers, equipment used at each operation, relationships, interactions and correlations between categorical and continuous variables, relationships between predictor and response variables, effects over time, basic assumptions about the distributions of the variables and data integrity. Using SAS Institute's JMP statistical analysis software package, several solutions will be proposed to address these data issues. The following techniques are presented: making multiple scatterplots to highlight potential outliers, constructing frequency tables to highlight missing cells and small sample sizes, using date variables to compare tools running simultaneously, changing color and symbol type to add dimensionality to the data, concatenating categorical variables to look for interactions, constructing histograms and probability plots to check data distributions, and using summary sample size tables to check data integrity. These techniques will enable the analyst to make sound, realistic and statistically correct decisions when data mining with large historical data sets.
机译:当一个决定开始数据挖掘项目时,有两个必须在开始时完成的两个关键任务:清楚地定义项目的目标和期望,并在执行任何数据挖掘或建模之前正确准备数据。当数据挖掘与历史数据集合中,需要了解数据的若干方面:◦可见数据类型,数据结构,潜在异常值的存在,在每个操作,关系,相互作用和分类和连续变量之间的相关性,预测器之间的关系。和响应变量,随着时间的推移效果,关于变量分布的基本假设和数据完整性。使用SAS Institute的JMP统计分析软件包,将提出几种解决方案来解决这些数据问题。提出了以下技术:使多个散点图突出显示潜在的异常值,构建频率表以突出缺少缺失的单元格和小样本大小,使用日期变量进行比较同时运行的工具,更改颜色和符号类型以向数据添加维度,串联分类变量要查找相互作用,构建直方图和概率图以检查数据分布,并使用摘要样本大小表来检查数据完整性。这些技术将使分析师能够在与大型历史数据集的数据挖掘时进行声音,现实和统计上正确的决策。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号