Advanced Search
XU Huanan, ZHOU Xiaoyan, JIANG Wan, LI Dapeng. Speech emotion recognition algorithm based on self-attention spatio-temporal featuresJ. Technical Acoustics, 2021, 40(6): 807-814. DOI: 10.16300/j.cnki.1000-3630.2021.06.011
Citation: XU Huanan, ZHOU Xiaoyan, JIANG Wan, LI Dapeng. Speech emotion recognition algorithm based on self-attention spatio-temporal featuresJ. Technical Acoustics, 2021, 40(6): 807-814. DOI: 10.16300/j.cnki.1000-3630.2021.06.011

Speech emotion recognition algorithm based on self-attention spatio-temporal features

  • To solve the problem that the key spatio-temporal dependencies can not be modeled in speech emotion recognition(SER), which leads to the low recognition rate, a speech emotion recognition algorithm based on self-attention spatiotemporal features is proposed. Bilinear convolution neural network, short-term memory network and multi-head attention mechanism are used to automatically learn the best spatio-temporal representation of speech signal. Firstly, the log-Mel feature, the first-order difference and second-order difference of speech signal are extracted to synthesize 3D log-Mel feature set as the input of CNN network. Then, considering the relation of spatial feature and temporal dependence, the output of bilinear pooling and bidirectional long short-term memory network is fused to obtain spatio-temporal feature representation, and the multi-head attention mechanism is used to capture the discriminative feature. Finally, the softmax function is used to classify. Experiments on IEMOCAP and EMO-DB databases are carried out, and the results show that the recognition rates of the two databases are 63.12% and 87.09% respectively, which proves the effectiveness of the method.
  • loading

Catalog

    Turn off MathJax
    Article Contents

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return