Systems and method identify potentially mislabeled file samples. A graph is created from a plurality of sample files. The graph includes nodes associated with the sample files and behavior nodes associated with behavior signatures. Phantom nodes are created in the graph for those sample files having a known label. During a label propagation operation, a node receives data indicating a label distribution of a neighbor node in the graph. In response to determining that the current label for the node is known, a neighborhood opinion is determined for the associated phantom node, based at least in part on the label distribution of the neighboring nodes. After the label propagation operation has completed, differences between the neighborhood opinion and the current label distribution for nodes are determined. If the difference exceeds a threshold, then the current label may be incorrect.
展开▼