用于提升聋哑人语音表现力的语音合成技术

马皓天; 洪峰; 毛海全; 郑立通; 牟宏宇; 许伟杰

doi:10.16300/j.cnki.1000-3630.23031601

摘要: 目前，聋哑人主要通过手语的方式与健听人进行沟通，但这对未接受专业手语学习的健听人来说是一种挑战。因此，将手语转换为文本，再将文本转换成带有聋哑人音色的、健听人能理解的语音非常具有研究意义。为研究聋哑人语音合成的可行性，文章首先分析了聋哑人的语音特征，并根据分析的结论，提出了能合成高自然度、高清晰度且带有聋哑人自身声音特色的模型算法以及相应的评估体系。文章根据不同残疾程度的聋哑人语音特征，提出了面向轻度残疾聋哑人的语音转换和合成方法以及面向重度残疾聋哑人的语音克隆方法。根据分析结果，轻度残疾聋哑人语音与健听人语音具有一定的共性，因此使用AdaIN-VC语音转换模型转换出带有聋哑人音色、高可懂度的语音，并将转换好的语音结合Tacotron2语音合成模型进行文本到语音的映射。考虑到重度残疾聋哑人语音的不稳定性，文章基于Zero-shot的SV2TTS语音克隆框架，使用了ECAPA-TDNN作为重度残疾聋哑人音色表征的说话人编码器，以获取准确的聋哑人表征。此外，文章还引入基于基频情感分类的风格迁移模块，对合成语音进行风格上的迁移。实验结果表明，在保证一定相似度的情况下，实验中两位轻残聋哑人的自然度主观意见评分别从原来的2.53和3.06提高至2.88和3.21，并且语音识别的错词率从100%分别降低至80.77%和76.91%。同样，文中提出的主观错词率也有明显的下降。而在语音克隆的实验中，模型合成的重残聋哑人语音与其自身音色的相似度主观相似度意见评分达到3，且聋哑人语音的自然度主观意见评分和情感表达能力均得到了提高。

Abstract: Currently, deaf people mainly use sign language to communicate with healthy people, however, most healthy people are untrained in sign language training. Therefore, it is of great importance to translate the sign language into spoken language using deaf accents that can be comprehended by the healthy people. To investigate the feasibility of text to speech (TTS) for the deaf people, the speech characteristics are analyzed firstly in this paper, and then, the TTS algorithms, which are capable of generating high naturalness and clarity speeches with deaf people's own voice characteristics, and the evaluation methods for these algorithms are developed. In this paper, a voice conversion and TTS method for mildly disabled deaf people and a voice cloning method for sever deaf people based on the characteristics of their speech are proposed. According to the analysis results, the voice of the mildly disabled deaf person has some similarities with the healthy voice, so the AdaIN-VC speech conversion model is used to convert the voice with the timbre and high understanding of the deaf person, and the converted voice is combined with the Tacotron2 speech synthesis model to map the text to the speech. Considering the instability of severely disabled deaf speech, the ECAPA-TDNN is used as the speaker coder for the tone representation of severely disabled deaf people to obtain accurate deaf representations. In addition, the style migration module based on the base frequency emotion classification is introduced to transfer the style of the synthetic speech. The experimental results show that under the condition of ensuring certain similarity, the subjective opinion scores of the two mild deaf people in the experiment increased from 2.53 and 3.06 to 2.88 and 3.21, respectively, and the misword rate of speech recognition is reduced from 100% to 80.77% and 76.91%, respectively. Similarly, the rate of subjective miswords proposed in the paper has also decreased significantly. However, in the experiment of speech cloning, the subjective similarity opinion score for the similarity of the severely disabled deaf speech and its own timbre reached 3, and the natural subjective opinion score and emotional expression ability of the deaf speech are improved.

用于提升聋哑人语音表现力的语音合成技术

Study on text to speech improving the voice expression of deaf people