Abstract:
Acoustic scene classification is one of the most difficult tasks in computer hearing. It is difficult to achieve good classification performance by using basic convolutional neural network structure under the condition of single feature. To solve this problem, this paper proposes an acoustic scene classification scheme based on time-frequency feature fusion and multi-resolution convolutional neural network. In the model design, a multi-resolution pooling scheme is adopted to construct a multi-resolution convolutional neural network, which can better adapt to the time-frequency structure of feature extraction. In the feature extraction, the Log Mel-band energies of low level envelope features and the non-negative matrix decomposition coefficient matrix of high level structure features are fused into three dimensional features to input the classification model. Training and testing are carried out on the development data sets of the acoustic scene classification and event detection challenge in 2017 and 2018. The experimental results show that the classification accuracy of the proposed scheme is 7.5% and 10.3% higher than the classification accuracy of the baseline system respectively.