Speech emotion recognition algorithm based on self-attention spatio-temporal features

XU Huanan; ZHOU Xiaoyan; JIANG Wan; LI Dapeng

doi:10.16300/j.cnki.1000-3630.2021.06.011

XU Huanan, ZHOU Xiaoyan, JIANG Wan, LI Dapeng. Speech emotion recognition algorithm based on self-attention spatio-temporal featuresJ. Technical Acoustics, 2021, 40(6): 807-814. DOI: 10.16300/j.cnki.1000-3630.2021.06.011

Citation:

Speech emotion recognition algorithm based on self-attention spatio-temporal features

Abstract

Abstract

To solve the problem that the key spatio-temporal dependencies can not be modeled in speech emotion recognition(SER), which leads to the low recognition rate, a speech emotion recognition algorithm based on self-attention spatiotemporal features is proposed. Bilinear convolution neural network, short-term memory network and multi-head attention mechanism are used to automatically learn the best spatio-temporal representation of speech signal. Firstly, the log-Mel feature, the first-order difference and second-order difference of speech signal are extracted to synthesize 3D log-Mel feature set as the input of CNN network. Then, considering the relation of spatial feature and temporal dependence, the output of bilinear pooling and bidirectional long short-term memory network is fused to obtain spatio-temporal feature representation, and the multi-head attention mechanism is used to capture the discriminative feature. Finally, the softmax function is used to classify. Experiments on IEMOCAP and EMO-DB databases are carried out, and the results show that the recognition rates of the two databases are 63.12% and 87.09% respectively, which proves the effectiveness of the method.

FullText(HTML)

References (19)

Cited By

Turn off MathJax

Article Contents

Speech emotion recognition algorithm based on self-attention spatio-temporal features

Abstract

Catalog

Export File

Citation

Format

Content