Nowadays, the automatic recognition of violent or aggressive human activities is an important issue since it improves public safety. Violent acts are characterized by several features that make them distinguishable from other normal behaviors. The most relevant are audio and visual features, commonly used in multi-sensor architectures along with data fusion techniques. In this paper we address a particular kind of visual cue extracted from monocular colour video streams, namely: the spatial-temporal behaviour of coloured stains. We show the importance of such a cue for the recognition of violent activities. Unlike previous approaches, in our system only little knowledge is assumed about the acquisition setup and about the content of the acquired scenes. Since we use low-level features and some warping and motion parameters, it is not necessary to extract accurate target silhouettes, that is a critical task because of occlusions and overcrowding that are typical during interpersonal contacts. A new index, called Maximum Warping Energy (MWE), has been defined to describe the localized spatial-temporal complexity of colour conformations. Our experiments show that aggressive activities give significantly higher MWE values if compared with safe actions like: walking, running, embracing or handshaking. So it is possible to distinguish violent acts from normal behaviours even in presence of many people and crowded environments.
展开▼