Speech emotion recognition algorithm based on self-attention spatio-temporal features
-
Abstract
To solve the problem that the key spatio-temporal dependencies can not be modeled in speech emotion recognition(SER), which leads to the low recognition rate, a speech emotion recognition algorithm based on self-attention spatiotemporal features is proposed. Bilinear convolution neural network, short-term memory network and multi-head attention mechanism are used to automatically learn the best spatio-temporal representation of speech signal. Firstly, the log-Mel feature, the first-order difference and second-order difference of speech signal are extracted to synthesize 3D log-Mel feature set as the input of CNN network. Then, considering the relation of spatial feature and temporal dependence, the output of bilinear pooling and bidirectional long short-term memory network is fused to obtain spatio-temporal feature representation, and the multi-head attention mechanism is used to capture the discriminative feature. Finally, the softmax function is used to classify. Experiments on IEMOCAP and EMO-DB databases are carried out, and the results show that the recognition rates of the two databases are 63.12% and 87.09% respectively, which proves the effectiveness of the method.
-
-