
1. Traditional Approach: Augment on the data quantity
1.1. Time Domain
1.1.1. Speech Rate (Tempo) Perturbation
1.1.1.1. WSOLA (Waveform Similarity Overlap and Add)
1.1.1.2. PSOLA (Pitch Synchronous Overlap and Add)
1.1.2. Speed Perturbation
1.1.2.1. Resample in Time Domain
1.1.2.1.1. lead to both change in audio length and spectrum envelope
1.1.2.2. Perturbation & Interpolation on frame-level Spectrum
1.1.2.2.1. Theoretically same effect as above, just a different way of implementation
1.1.3. Add noise
1.1.4. Add reverberation
1.2. Frequency Domain
1.2.1. Vocal Tract Length Perturbation (VTLP)
1.2.1.1. Mimic the difference in vocal tract length (mainly linear frequency warping)
1.2.2. Stochasitc Feature Mapping
1.2.2.1. Estimate a maximum likelihood linear transformation in some feature space of the source speaker against the speaker dependent model of the target speaker (statistical method)
1.2.3. Perturbation on (log mel) spectrogram (SpecAugment)
1.2.3.1. Time warping: deformation of the time-series in the time direction
1.2.3.2. Time masking: mask a block of consecutive time steps
1.2.3.3. Frequency masking: mask a block of consecutive mel frequency channels
1.2.4. Generative Adversarial Network (GAN)
1.2.4.1. More variations than VTLP in terms of frequency warping
1.2.4.2. Spectrogram approach: need to do alignment in audio length on pair-wised (ctrl, dys) utterances
1.2.4.2.1. Zero padding in the shorter audio, but this may lead to background noise in the final generated audio
1.2.4.3. (Frame level) Spectrum approach: generate the audio frame by frame based on spectrum
2. End to End Approach: Perturb on the latent variable / vectors
2.1. Speed Perturbation (directly manipulate the time series of frequency vectors)
2.2. SpecAugment
2.3. Sub-sequence Sampling (with constraints, such as the length of the sub-sequence is greater than half of the original sequence)
3. Subspace Learning: Augment on the dimension of features
3.1. SVD: Frequency domain (eigen)space basis vectors
3.1.1. Same dimension for all uttterances. The dimension can be set to no. of filter banks (plus first order derivative)
3.2. SVD: Tempo domain (eigen)space basis vectors
3.2.1. Dynamic Time Warping to get the same length
3.2.2. Encoder - Decoder to get the same length