2022-01-30 13:57:59
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
- [Illustration](https://scontent-arn2-1.xx.fbcdn.net/v/t39.2365-6/271815807_4636921079718503_8613393990345138136_n.gif?_nc_cat=107&ccb=1-5&_nc_sid=ad8a9d&_nc_ohc=yn27DielBOYAX8rk045&_nc_ht=scontent-arn2-1.xx&oh=00_AT8ueSOOllDdunQw26KIBUYwyoOq_b1leSPKrmSfZoeazA&oe=61F26871)
- [Link](https://ai.facebook.com/blog/the-first-high-performance-self-supervised-algorithm-that-works-for-speech-vision-and-text/)
- These are actually 3 separate models (!) - marketing lies as usual
- No clear indication, but the NLP model uses 16 GPUs, others - not specified
- The first high-performance self-supervised algorithm that works for speech, vision, and text
- Trained by predicting the model representations of the full input data given a partial view of the input
- Standard Transformer architecture with a modality-specific encoding
- The encoding of the unmasked training sample is parameterized by an exponentially moving average of the model parameters
- Training targets based on the output of the top K blocks of the teacher network for time-steps which are masked in student mode
- We apply a normalization to each block before averaging the top K blocks
- For speech representations, we use instance normalization
- For NLP and vision we found parameter-less layer normalization
- 800 epochs, 86M parameters and 307M parameters
- Smooth L1 loss
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
- 2106.07447
- Offline clustering step to provide aligned target labels for a BERT-like prediction loss
- Applying the prediction loss over the masked regions only
- Relies on the consistency of the unsupervised clustering step rather than the intrinsic quality of the assigned cluster labels
- Acoustic unit discovery models to provide frame-level targets
- How to mask and where to apply the prediction loss:
- p% of the timesteps are randomly selected as start indices, and spans of l steps are masked
- cross-entropy loss computed over masked and unmasked timesteps, weighted, α parameter
- α = 1 is more resilient to the quality of cluster targets, which is demonstrated in our experiments
- Multuple clustering, iterative refinement starting with MFCC
- Convolutional waveform encoder, a BERT encoder, a projection layer and a code embedding layer
- BASE, LARGE, and X-LARGE - 95M, 317M, 964M
- ![image](https://user-images.githubusercontent.com/12515440/150782226-92accb43-380a-4e0f-91f5-86fdba4624ce.png)
- Convolutional encoder generates a feature sequence at a 20ms framerate for audio sampled at 16kHz (CNN encoder down-sampling factor is 320x)
- After pre-training, CTC loss for ASR fine-tuning of the whole model weights except the convolutional audio encoder, which remains frozen
- CTC target vocabulary includes 26 English chars + space + apostrophe + CTC blank
- 960h of LibriSpeech + 60kh of Libri-light
- First iteration labels: 960 hour LibriSpeech training set, k-means clustering with 100 clusters on 39-dimensional MFCC features, which are 13 coefficients with the first and the second-order derivatives
- For the subsequent iterations, k-means clustering with 500 clusters on the latent features from the HuBERT model pre-trained in the previous iteration
- MiniBatchKMeans
- BASE - two iterations on the 960h on 32 GPUs (batch size of at most 87.5 seconds of audio per GPU), 250k steps
- LARGE and X-LARGE for one iteration on 60kh on 128 and 256 GPUs, respectively, for 400k steps
#digest
392 viewsAlexander, 10:57