Abstract:
Traditional singing voice conversion models suffer from insufficient disentanglement between content information and speaker identity features. As a result, the singing style of the source singer cannot be fully eliminated, and feature learning for the target singer remains inadequate, leading to unnatural conversion outcomes. To enhance model generalization and robustness, data augmentation has become critical; however, existing methods are relatively simplistic and struggle to preserve the salient acoustic information of singing voices. To address these issues, this paper proposes ADA-SVC, a singing voice conversion model with adaptive data augmentation. Our method introduces an adaptive data augmentation module during training, which dynamically generates high-quality samples sharing identical linguistic and prosodic content but with subtly modified timbre—grounded in acoustic principles. This enables the model to better distinguish between content and speaker identity features, thereby achieving more effective disentanglement. Meanwhile, a speaker encoder is employed to extract singer-specific information, a pitch extractor is used to model fundamental frequency contours, and the prior/posterior encoders along with the normalizing flow module from VITS are integrated to realize end-to-end singing voice conversion. Experimental results demonstrate that ADA-SVC improves the MCD score by 8.7% over the So-VITS baseline, and the subjective similarity MOS is significantly higher than both the baseline and ablation models, indicating a clear improvement in conversion quality.