How should I pronounce your name?

Improving the performance of Voice Conversion technologies

Has your name ever been mispronounced? This project started off with the motivation to pronounce names of students from all around the world correctly in the voice of an announcer at graduation ceremonies! Check out our paper and github repository for more details!

In this project, we explored SOTA Voice Conversion (VC) models such as fragmentvc and autovc and saw that they have impressive few shot conversion quality! However, when source or target speech accents, background noise conditions or mic characteristics differ from training, the performance of voice conversion is not guaranteed.

Its often challenging to document how to record quality utterances that’ll work well with a VC model. In fact, we observed that sampling rate, noise/vol levels, mic quality, clarity of speech all had significant impact on the output. We also saw that so many real-world users were frustrated while trying to use VC models on their own data (as evidenced by the litany of github issues). Among others, one of the issues is that these VC models are overfit to clean VCTK data (native English speakers).

Litany of github issues
Issues faced by real-world users.

In light of this and motivated by a real-world application, we introduce a robust variant of fragmentvc - where we adapt the model via data augmentation during the training (we augment CV data with accented english speech) and a noising module that adds randomly sampled effects to the data during feature extraction.

Adapting FragmentVC
Robust variant of FragmentVC.

In doing so, we see that it is in fact possible to adapt SOTA models easily and improve their performance and help real-world users with a better checkpoint that they can use on their own data.

Avg speakerwise score
Speakerwise mean quality Likert score histograms for baselines and our robust models CV-finetune and CV-VCTK.
All speakers
Speakers grouped by gender
Speakers grouped by demographic group
Average scores and standard deviations representing model performance of all 4 models across all annotators.
<!--
  See https://www.debugbear.com/blog/responsive-images#w-descriptors-and-the-sizes-attribute and
  https://developer.mozilla.org/en-US/docs/Learn/HTML/Multimedia_and_embedding/Responsive_images for info on defining 'sizes' for responsive images
-->

  <source
    class="responsive-img-srcset"
    
      srcset="/assets/img/CollageWide-480.webp 480w,/assets/img/CollageWide-800.webp 800w,/assets/img/CollageWide-1400.webp 1400w,"
      type="image/webp"
    
    
      sizes="95vw"
    
  >

<img
  src="/assets/img/CollageWide.png"
  
    class="img-fluid rounded z-depth-1"
  
  
    width="100%"
  
  
    height="auto"
  
  
  
  
    title="Litany of github issues"
  
  
  
    loading="lazy"
  
  onerror="this.onerror=null; $('.responsive-img-srcset').remove();"
>

</picture>

</figure>

</div>

</div>

Issues faced by real-world users.

In light of this and motivated by a real-world application, we introduce a robust variant of fragmentvc - where we adapt the model via data augmentation during the training (we augment CV data with accented english speech) and a noising module that adds randomly sampled effects to the data during feature extraction.

Robust variant of FragmentVC.

In doing so, we see that it is in fact possible to adapt SOTA models easily and improve their performance and help real-world users with a better checkpoint that they can use on their own data.

Speakerwise mean quality Likert score histograms for baselines and our robust models CV-finetune and CV-VCTK.
Average scores and standard deviations representing model performance of all 4 models across all annotators.

–>