Abstract:
To address the problems of discriminative emotional feature extraction in speech emotion recognition, a speech representation method based on two-channel feature fusion is proposed by combining convolutional neural network and vision transformer network structure. The convolutional module channel based on the inverted bottleneck structure is introduced into a transformer like training strategy to extract local spectral features. The global sequence features are extracted by improving the vision transformer, and the whole speech spectrogram is directly extracted instead of the chunked part by using a convolutional neural network for better extraction of the temporal information, and the extracted feature information is fused to obtain strong discriminant emotion features, which are finally input to the Softmax classifier to get recognition results. Experiments on EMO-DB and CASIA databases show that the average accuracy of the modle propsed in this paper is 94.24% and 93.05%, respectively. Compared with other models, the results are better, indicating the effectiveness of the methods.