SelfVC: Voice Conversion With Iterative Refinement using Self Transformations

Interactive Demo

We present audio examples for our paper SelfVC: Voice Conversion With Iterative Refinement using Self Transformations. To perform zero-shot voice conversion, we use our synthesis model to combine the content embedding of any given source utterance with the speaker embedding of the target speaker derived from a speaker verification model using 10 seconds of audio of the target speaker. Our synthesizer can perform voice conversion in two modes:

Guided: In this setting, the prosody (speaking rate and pitch modulation) of the synthesized speech matches closely with the prosody of the source utterance. To achieve, this the ground-truth duration and normalized F0 contour of the source utterance are used as intermediate inputs in the mel-spectrogram synthesizer.
Predictive: In this setting, the prosody of the synthesized speech is adapted based on the target speaker's audio. That is, we predict the normalized F0 contour and durations based on on both the content and speaker embeddings, using the duration and pitch predictor in the synthesis model.

Zero-shot Any-to-Any Voice Convesion

For zero-shot any-to-any voice convrsion, we select 10 target speakers (5 random Male and 5 random Female) from the test-clean subset of the LibriTTS dataset. Next, we randomly select 20 source utterances from the remaining speakers and perform voice conversion for each of the 10 target speakers. We present a few audio examples for this experiment in the table below.

Conversion Type	Source Utterance	Target Speaker	Ours - SelfVC (Predictive)	Ours - SelfVC (Guided)

Comparison Against Past Work

We present audio examples for the same pair of source and target audio using different voice conversion techniques including our own. The source uttreances and target speakers are selected from the test-clean LibriTTS in the same way as described above for zero-shot any-to-any voice conversion. In this setting, we use the predictive mode for pitch and guided mode for duration to ensure a fair comparison since previous techniques preserve the duration of the source utterance. We produce audio examples for prior techniques using the voice convesion inference script provided in the respective official github repositories.

Conversion Type	Source Utterance	Target Speaker	MediumVC	S3PRL-VC	YourTTS	ACE-VC	Ours - SelfVC

Cross Lingual Voice Conversion

For Cross lingual voice conversion, we use the CSS10 dataset that contains utterances from 10 different languages (Chinese, Greek, Finnish, Spanish, Dutch, German, Japanese, French, Hungarian, Russian). We consider three voice conversion tasks: English to CSS10, CSS10 to English, and CSS10 to CSS10. We present voice converted examples from two of our models: 1) SelfVC (LibriTTS) which is trained only on English speech from train-clean-360 subset of LibriTTS. 2) SelfVC (LibriTTS + CSS10) which is finetuned on both English and CSS10 speech.

Conversion Type	Source Utterance (CSS10)	Target Speaker (English)	Ours - SelfVC (LibriTTS)	Ours - SelfVC (LibriTTS + CSS10)

Conversion Type	Source Utterance (English)	Target Speaker (CSS10)	Ours - SelfVC (LibriTTS)	Ours - SelfVC (LibriTTS + CSS10)

Conversion Type	Source Utterance (CSS10)	Target Speaker (CSS10)	Ours - SelfVC (LibriTTS)	Ours - SelfVC (LibriTTS + CSS10)

Manual Prosody Control

Besides the above two inference modes (guided and predictive), SelfVC also offers fine-grained control over the prosody of the synthesized speech. During inference, we can simply modify the pitch contour (normalized F0 Contour) and extracted duration of the source utterance to control the prosody of the synthesized speech. We present audio examples obtained by scaling the reference pitch contour and duration by a factor. This behaviour is similar to ACE-VC, except that we do not require any text transcriptions during training to extract duration targets for the duration predictor.

Pace Control
Pitch Control

Conversion Type	Source Utterance	Target Speaker	Same Pace	Fast Pace (1.5 X)	Slow Pace (0.7 X)

Conversion Type	Source Utterance	Target Speaker	Same Pitch (1X)	Higher Pitch (3X)	Lower Pitch (0.5X)

Bonus! SelfVC Finetuned on Celebrity Voices

Finally, we present audio examples from SelfVC finetuned on just a few minutes of speech data from different Celebrities. Since SelfVC is trained in a text-free manner, we can adapt it for any speaker with only audio data. For this experiment, we download audio monologues (roughly 5-10 minutes per speaker) for each speaker from Youtube and finetune SelfVC (LibriTTS) on the combined data. We accompany the generated audio with lip-synced video avatars generated from One Shot Talking Face

We also have a live demo where you can upload your own audio and convert it to any of the celebrity voices below.

Disclaimer: This demo is for academic and research purposes only. We do not own the rights to the audio or video content used in this demo.

Source Utterance	Target Speaker	SelfVC Generated Audio