首页> 外国专利> Unknown malcode detection using classifiers with optimal training sets

Unknown malcode detection using classifiers with optimal training sets

机译:使用具有最佳训练集的分类器进行未知恶意代码检测

摘要

The present invention is directed to a method for detecting unknown malicious code, such as a virus, a worm, a Trojan Horse or any combination thereof. Accordingly, a Data Set is created, which is a collection of files that includes a first subset with malicious code and a second subset with benign code files and malicious and benign files are identified by an antivirus program. All files are parsed using n-gram moving windows of several lengths and the TF representation is computed for each n-gram in each file. An initial set of top features (e.g., up to 5500) of all n-grams IS selected, based on the DF measure and the number of the top features is reduced to comply with the computation resources required for classifier training, by using features selection methods. The optimal number of features is then determined based on the evaluation of the detection accuracy of several sets of reduced top features and different data sets with different distributions of benign and malicious files are prepared, based on the optimal number, which will be used as training and test sets. For each classifier, the detection accuracy is iteratively evaluated for all combinations of training and test sets distributions, while in each iteration, training a classifier using a specific distribution and testing the trained classifier on all distributions. The optimal distribution that results with the highest detection accuracy is selected for that classifier.
机译:本发明涉及一种用于检测未知的恶意代码的方法,所述未知的恶意代码诸如病毒,蠕虫,特洛伊木马或其任何组合。因此,创建了数据集,该数据集是文件的集合,该文件包括具有恶意代码的第一子集和具有良性代码文件的第二子集,并且恶意和良性文件由防病毒程序标识。使用多个长度的n-gram移动窗口解析所有文件,并为每个文件中的每个n-gram计算TF表示形式。基于DF度量,选择了所有n-gram的初始一组主要特征(例如,最多5500个),并通过使用特征选择减少了主要特征的数量,以符合分类器训练所需的计算资源方法。然后,根据对几组精简特征的检测准确性的评估,确定最佳特征数,并根据最佳数目准备具有不同分布的良性和恶意文件的不同数据集,并将其用作训练和测试集。对于每个分类器,迭代地评估训练和测试集分布的所有组合的检测准确性,而在每次迭代中,使用特定分布训练分类器并在所有分布上测试经过训练的分类器。为该分类器选择具有最高检测精度的最佳分布。

著录项

  • 公开/公告号EP2128798A1

    专利类型

  • 公开/公告日2009-12-02

    原文格式PDF

  • 申请/专利权人 DEUTSCHE TELEKOM AG;

    申请/专利号EP20090007053

  • 发明设计人 MOSKOVITCH ROBERT;ELOVICI YUVAL;

    申请日2009-05-27

  • 分类号G06K9/62;G06F21;

  • 国家 EP

  • 入库时间 2022-08-21 18:36:59

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号