Label Delay In Online Continual Learning

Manual:

This demo shows an instance of a realistic online learning problem. The stream of data points is collected from the webcam, and therefore it is under your full control how much the data-distribution is shifting. When you press one of the "Add Category" buttons, the newest sample will be associated with the corresponding label. There is no restriction what you can teach to your computer, everything is computed locally on your device, i.e. no data is sent to the server at all. In every step the data flows from the top of the Pending Entries to the bottom, and if the data point is labeled, it will be moved to the Memory Buffer. As you can see, there are a few slots in the Pending Entries, which means that the newest sample will not be available for training until the sample does not receive it's label. You can train the model by pressing the "Update Model" button, which will update the model parameters based on the selected samples.

The main purpose of this simulation is to highlight how label delay can affect the performance of the model. In the main paper, we propose the Importance Weighted Memory Sampling (IWMS) method, which is a compute-efficient technique that simply just reuses the feature embeddings (that were computed anyways during the forward pass) to emulate the distribution of the newest samples during the memory rehearsal. To show that our method truly brings benefits, we implemented a wide range of environment settings.

The Environment Settings

The sampling rate of the stream
The delay $d$ between observing the newest sample and receiving its label
The memory buffer size
The architecture of the model: 7 architectures, 2 pretrained
The optimizer (SGD, SGD w/ momentum, Adam)
The learning rate
The selection policy for the first and second sample
Whether to update features after each training iteration

The Prediction Card

The prediction card can be found on the top of the page. It consists of three blocks: the input image, the computed features and the prediction. The features are computed by the backbone architecture and are always projected to a 9-dimensional space. In the middle block you can find the feature embeddings arranged in a 3x3 grid, where the sizes of the circles represent the values of the embeddings. A little technical detail: the embeddings can be in the range of $[-\inf, \inf]$, and the size of the circle sizes are always normalized to the range of $[0, 1]$ for each data point. Finally, the actual prediction is shown in the rightmost block by the three circles representing the class probabilities. If you turn your webcam on, the newest sample will be the current frame and you will see how the embeddings are changing in real-time. It is interesting to see how the embeddings are changing when you are moving around, or when you are changing the lighting conditions. Different architectures respond quite differently to the same kind of visual changes, which can be also thought of as a characteristic of the architecture and of course the pretraining. For example, the linear model is very sensitive to every small change, while the MobileNetV2 features (which were pretrained on ImageNet) will mostl likely change very little when you change lighting conditions or rotate the object. Just like the feature embeddings, the prediction values are also updated in real-time, and the color of the largest circle is the argmax of the predicted probabilities.

The Datacards

If you click on the "Start Data Stream" button, the datacards will start to flow from the top to the bottom of the pending entries (with a default Categor 1 label), and straight away get added to the memory buffer. This is nicely animated in the beginning, but to avoid the moving parts, once the datacards fill up the edges of the grid, you will only see the content of the datacards being swapped out (instead of moving the cards themselves). When you change the Pending Entry or the Memory Buffer size, new datacards will be added or removed accordingly. Similarly to the prediction card, the datacards are also showing the feature embeddings in a 3x3 grid. The color of the background of the datacard indicates the category of the sample.

The Model

By default, the model is a linear model, randomly initialized with a fixed seed for reproducibility. Although a linear model might be too simple to learn generalizable features, you will be surprised how well it can perform when you have a small delay between the newest sample and the labeled data. To explore the performance of more complex models, you can select from the following architectures:

ResNet18
CNN Small
CNN Base
CNN Large
MobileNetV2 (ImageNet pretrained)
MobileNetV3 (ImageNet pretrained)

After the model selection you can start tuning the model parameters on the currently available memory samples, but be careful: the learned features are reset when the architecture is changed. When you click on the "Update Model" button, a green bar will start to fill up, indicating the cycles in which the model parameters are updated (set to 3 sec). In every iteration, the standard Cross Entropy objective is optimized on two samples: $X_1$ and $X_2$.

Selection Policies

The selection policy for the first and second sample can be set to one of the following:

Random: samples are selected randomly from the memory buffer
Newest: the newest sample is selected
Second Newest: the second newest sample is selected
Importance Weighted Memory: the nearest neighbor of the newest sample (in the feature space) is selected

The Importance Weighted Memory Sampling policy is a compute-efficient technique that simply just reuses the feature embeddings (that were computed anyways during the forward pass) to emulate the distribution of the newest samples during the memory rehearsal. This can have a significant impact on the performance of the model, especially when the newest sample is delayed for a long time.

The Similarity Grid

For each pending entries ($X_i$) and memory buffer ($X_j$) sample, we first compute the cosine similarity between their embeddings: $$\textrm{cos}\left(f_\theta(X_i), f_\theta(X_j)\right).$$ Second, we compute the softmax for every row, representing the probability of the IWM sampling policy selecting $X_j$: $$\textrm{softmax}_i = \frac{\exp(\textrm{cos}(X_i, X_j))}{\sum_{j=1}^N \exp(\textrm{cos}(X_i, X_j))}$$ For visualization in the grid, we rescale the probabilities such that the highest probability will fill the entire cell, and the colour will match the colour of the corresponding memory buffer sample.