Sawsing: A DDSP-Based Singing Vocoder Via Subtractive Sawtooth Waveform Synthesis

ABSTRACT

Neural singing voice synthesis converts spectral features predicted from musical scores into waveforms using a neural network-based vocoder. However, neural vocoders that perform well on speech data may suffer from phase discontinuities which are more audible during the longer utterances contained in singing. Taking advantage of the strong inductive biases proposed in Differentiable Digital Signal Processing (DDSP), we propose SawSing, a DDSP-based singing vocoder that reconstructs a singing voice waveform from a mel-spectrogram. SawSing uses a neural network to tune the filter coefficients of subtractive sawtooth and Gaussian noise synthesizers. Because this architecture enforces phase continuity, SawSing can generate favorable singing voices without the characteristic phase-discontinuity glitch of other neural vocoders. It is also lightweight, interpretable, and trainingefficient.

Previous
Previous

What is Financialization?