Abstract:
The speaker verification technique for financial payments in China is not widely promoted at the societal level due to lack of datasets and the security of the models. In this paper, a text-related speaker verification framework based on transfer learning and fundamental frequency feature fusion is proposed to address the above problems on the self-recorded SHALCAS-WXSD22B dataset. In the digital string SHALCAS-WXSD22B-d006 and SHALCASWXSD22B-d007 corpus experiments, the best equal error rates achieved by the proposed framework implementation are 0.88% and 1.05%. Compared with the ECAPA-TDNN baseline model, this method can reduce the equal error rates by 17% and 20% respectively and achieves security indicators in the field of financial payments. The experimental results show that the proposed method not only has better recognition accuracy and higher security performance compared to baseline methods, but also can be applied to other log-Mel models including ResNet34.