Ensemble of deep learning, visual and acoustic features for music genre classification


In this work we present an ensemble for automated music genre classification that fuses acoustic and visual (both handcrafted and nonhandcrafted) features extracted from audio files. These features are evaluated, compared, and fused in a final ensemble shown to produce better classification accuracy than other state-of-the-art approaches on the Latin Music Database, ISMIR 2004, and the GTZAN genre collection. To the best of our knowledge, this paper reports the largest test comparing the combination of different descriptors (including a wavelet convolutional scattering network, which has been tested here for the first time as an input for texture descriptors) and different matrix representations. Superior performance is obtained without ad hoc parameter optimization; that is to say, the same ensemble of classifiers and parameter settings are used on all tested datasets. To demonstrate generalizability, our approach is also assessed on the tasks of bird species recognition using vocalization and whale detection datasets. All MATLAB source code is available.

Keywords Audio classification, texture, image processing, acoustic features, ensemble of classifiers, machine learning.

[full paper]