Abstract:
In speaker recognition application scenarios such as access control or time and attendance oriented, Chinese short digit string corpus can improve the user experience, however, at the cost of its performance degradation is obvious. To this end, this paper proposes a short speech-based speaker recognition that consists of a model pre-training phase and a transfer learning phase. First, an improved pre-training model is proposed, which effectively improves the generalization ability of the text-independent speaker recognition model through feature enhancement and preheating network. Meanwhile, this paper proposes a multi-subcenter cross-entropy speaker classification loss, which effectively improves the adaptation ability from the source domain to the target domain in the transfer learning phase. In addition, a long and short speech embedding code relative entropy loss is proposed to improve the performance by mapping the short speech embedding code distribution to the long speech distribution which is richer in timbre information. Experimental results on the Chinese short speech dataset SHAL show that the pre-trained model proposed in this paper has high generalization ability, and the joint loss consisting of multi-subcenter cross-entropy loss and long and short speech embedding code relative entropy loss can also effectively improve the performance of the model.