A machine learning framework is presented that supports data mining and statistical modeling of systems that are monitored by large-scale sensor networks. The proposed algorithm is novel in that it takes both observations and domain knowledge into consideration and provides a mechanism that combines analytical modeling and inductive learning. An efficient solver is presented that allow the algorithm to solve large-scale problems efficiently. The solver uses a randomized kernel that incorporates domain knowledge into support vector machine learning. It also takes advantage of the sparseness of support vectors and this allows for parallelization and online training to further speed-up of the computation. The solver can be integrated into existing systems, embedded into databases, or exposed as a web service. Understanding the data generated by large-scale system presents several problems. First, statistical modeling approaches may either under-fit or over-fit the data and are sensitive to data quality. Second, learning is a computational extensive process and often becomes intractable when the sample size exceeds several thousands.
展开▼