Audio tagging aims to perform multi-label classification on audioudchunks and it is a newly proposed task in the Detection andudClassification of Acoustic Scenes and Events 2016 (DCASEud2016) challenge. This task encourages research efforts to betterudanalyze and understand the content of the huge amounts ofudaudio data on the web. The difficulty in audio tagging is thatudit only has a chunk-level label without a frame-level label. Thisudpaper presents a weakly supervised method to not only predictudthe tags but also indicate the temporal locations of the occurredudacoustic events. The attention scheme is found to be effectiveudin identifying the important frames while ignoring the unrelatedudframes. The proposed framework is a deep convolutional recurrentudmodel with two auxiliary modules: an attention moduleudand a localization module. The proposed algorithm was evaluatedudon the Task 4 of DCASE 2016 challenge. State-of-the-artudperformance was achieved on the evaluation set with equal errorudrate (EER) reduced from 0.13 to 0.11, compared with theudconvolutional recurrent baseline system.
展开▼