高级检索

基于迁移学习和多尺度损失的短语音说话人识别方法

A short speech speaker recognition based on transfer learning and multi-scale loss

  • 摘要: 在面向门禁或考勤等说话人识别应用场景中,中文短数字串语料能够提高用户使用体验,然而代价是其性能下降明显。为此,本文提出了一种基于短语音的说话人识别框架,该框架包含模型预训练阶段和迁移学习阶段。首先,提出了一种改进的预训练模型,通过特征增强和预热网络有效提高了文本无关说话人识别模型的泛化能力。同时,本文提出了一种多重子空间交叉熵说话人分类损失,有效提高了迁移学习阶段从源域到目标域的适配能力。此外,还提出了一种长短语音嵌入码相对熵损失,通过将短语音嵌入码分布映射到音色信息更丰富的长语音分布上,从而提高性能。在中文短语音数据集SHAL上的实验结果表明,本文提出的预训练模型具有较高的泛化能力,多重子空间交叉熵损失和长短语音嵌入码相对熵损失组成的联合损失也能有效提高模型的性能。

     

    Abstract: In speaker recognition application scenarios such as access control or time and attendance oriented, Chinese short digit string corpus can improve the user experience, however, at the cost of its performance degradation is obvious. To this end, this paper proposes a short speech-based speaker recognition that consists of a model pre-training phase and a transfer learning phase. First, an improved pre-training model is proposed, which effectively improves the generalization ability of the text-independent speaker recognition model through feature enhancement and preheating network. Meanwhile, this paper proposes a multi-subcenter cross-entropy speaker classification loss, which effectively improves the adaptation ability from the source domain to the target domain in the transfer learning phase. In addition, a long and short speech embedding code relative entropy loss is proposed to improve the performance by mapping the short speech embedding code distribution to the long speech distribution which is richer in timbre information. Experimental results on the Chinese short speech dataset SHAL show that the pre-trained model proposed in this paper has high generalization ability, and the joint loss consisting of multi-subcenter cross-entropy loss and long and short speech embedding code relative entropy loss can also effectively improve the performance of the model.

     

/

返回文章
返回