Data Augmentation

Get Started. It's Free
or sign up with your email address
Data Augmentation by Mind Map: Data Augmentation

1. Traditional Approach: Augment on the data quantity

1.1. Time Domain

1.1.1. Speech Rate (Tempo) Perturbation WSOLA (Waveform Similarity Overlap and Add) PSOLA (Pitch Synchronous Overlap and Add)

1.1.2. Speed Perturbation Resample in Time Domain lead to both change in audio length and spectrum envelope Perturbation & Interpolation on frame-level Spectrum Theoretically same effect as above, just a different way of implementation

1.1.3. Add noise

1.1.4. Add reverberation

1.2. Frequency Domain

1.2.1. Vocal Tract Length Perturbation (VTLP) Mimic the difference in vocal tract length (mainly linear frequency warping)

1.2.2. Stochasitc Feature Mapping Estimate a maximum likelihood linear transformation in some feature space of the source speaker against the speaker dependent model of the target speaker (statistical method)

1.2.3. Perturbation on (log mel) spectrogram (SpecAugment) Time warping: deformation of the time-series in the time direction Time masking: mask a block of consecutive time steps Frequency masking: mask a block of consecutive mel frequency channels

1.2.4. Generative Adversarial Network (GAN) More variations than VTLP in terms of frequency warping Spectrogram approach: need to do alignment in audio length on pair-wised (ctrl, dys) utterances Zero padding in the shorter audio, but this may lead to background noise in the final generated audio (Frame level) Spectrum approach: generate the audio frame by frame based on spectrum

2. End to End Approach: Perturb on the latent variable / vectors

2.1. Speed Perturbation (directly manipulate the time series of frequency vectors)

2.2. SpecAugment

2.3. Sub-sequence Sampling (with constraints, such as the length of the sub-sequence is greater than half of the original sequence)

3. Subspace Learning: Augment on the dimension of features

3.1. SVD: Frequency domain (eigen)space basis vectors

3.1.1. Same dimension for all uttterances. The dimension can be set to no. of filter banks (plus first order derivative)

3.2. SVD: Tempo domain (eigen)space basis vectors

3.2.1. Dynamic Time Warping to get the same length

3.2.2. Encoder - Decoder to get the same length

4. Semi-Supervised Training: Collect more unlabelled audio

4.1. Use the labelled data to train a bootstrap system as a recognizer, then use it to decode on the unlabelled data and include a confidence level estimate on the decoding output. Set a threshold to filter those with low confidence level. Then add the remaining data with unsupervised label in to the training set and re-train the system.

4.2. May not be suitable for disordered speech recognition since we don't have the unlabelled audio

5. Augmentation Policy Learning

5.1. Reinforcement Learning: Learning policies (for image) and possibly policy sequences for speech