【24h】

HDSVM: A High Efficiency Distributed SVM Framework over Data Stream

机译:HDSVM:数据流上的高效分布式SVM框架

获取原文
获取原文并翻译 | 示例

摘要

The application of Support Vector Machine (SVM) over data stream is growing with the increasing real-time processing requirements in classification field, like anomaly detection and real-time image processing. However, the dynamic live data with high volume and fast arrival rate in data streams make it challenging to apply SVM in data stream processing. Existing SVM implementations are mostly designed for batch processing and hardly satisfy the efficiency requirement of stream processing for its inherent complexity. To address the challenges, we propose a high efficiency distributed SVM framework over data stream (HDSVM), which consists of two main algorithms, incremental learning algorithm and distributed algorithm. Firstly, we propose a partial support vectors reserving incremental learning algorithm (PSVIL). By selecting a subset of support vectors based on their distances to classification hyperplane instead of the universal set to update SVM, the algorithm achieves lower time overhead while ensuring accuracy. Secondly, we propose a distribution remaining partition and fast aggregation distributed algorithm (DRPFA) for SVM. The real-time data is partitioned based on the original distribution with clustering instead of random partition, and historical support vectors are partitioned based on their distances to the classification hyperplane. The global hyperplane can be obtained by averaging the parameters of local hyperplanes due to the above partition strategy. Extensive experiments on Apache Storm show that the proposed HDSVM achieve lower time overhead and similar accuracy compared with the state-of-art. Speed-up ratio is increased by 2-8 times within 1% accuracy deviation.
机译:支持向量机(SVM)在数据流上的应用随着分类领域对实时处理要求的提高而不断增长,例如异常检测和实时图像处理。但是,动态实时数据在数据流中具有高容量和快速到达率,这使得在数据流处理中应用SVM具有挑战性。现有的SVM实现主要是为批处理而设计的,由于其固有的复杂性,几乎不能满足流处理的效率要求。为了解决这些挑战,我们提出了一种高效的数据流分布式SVM框架(HDSVM),它由两个主要算法组成:增量学习算法和分布式算法。首先,我们提出了一种部分支持向量保留增量学习算法(PSVIL)。通过基于支持向量到分类超平面的距离而不是通用集来选择支持向量的子集来更新SVM,该算法可在确保准确性的同时实现较低的时间开销。其次,提出了支持向量机的剩余分布分区和快速聚合分布式算法(DRPFA)。实时数据基于原始分布进行聚类而不是随机分区,并且历史支持向量根据它们与分类超平面的距离进行分区。由于上述划分策略,可以通过平均局部超平面的参数来获得全局超平面。在Apache Storm上进行的大量实验表明,与最新技术相比,提出的HDSVM可以实现更低的时间开销和相似的准确性。在1%的精度偏差内,加速比提高了2-8倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号