Label Delay in Online Continual Learning

Botos Csaba¹², Wenxuan Zhang³, Matthias Müller², Ser-Nam Lim⁴, Mohamed Elhoseiny³, Philip Torr¹, Adel Bibi¹

¹University of Oxford, ²Intel, ³KAUST, ⁴Facebook AI Research, * Equal contribution

In NeurIPS 2024

[Paper] [Code] [Demo] [Video]

Our proposed Continual Learning setting considering Label Delay allows us to model a wide range of real-world applications where new raw data is revealed significantly sooner by the data stream $\mathcal{S}_{\mathcal{X}}$ than the annotation process $\mathcal{S}_{\mathcal{Y}}$ can provide the corresponding labels. The main objective is to maximize the accuracy on the newest Eval data using both the samples that have already received their label (in colour) and the more recent samples that are yet to be labeled (in gray).

Where does label delay come from?

In many real world scenario, the time between making predictions and the feedback can be vastly different due to the inherent nature of the task. Consider the following three examples: In medical applications, the predicted post-operation recovery time of the patient is one of the most important metrics, yet the official recovery time is only established during follow-up visits. In investment banking, the time it takes to receive the results of a trade can be significantly longer than the time it takes to execute the trade itself. In the world of copyright claims, an automated trigger mechanism can prevent fraudulent usage of the content sharing platform, however the actual evaluation of each case by the owners is often significantly delayed.

As one can see in the above examples, although the emergent problem of label delay is present across different applications, the root cause stems from entirely different sources. While the real world applications are heavily impacted by the phenomenon, the diversity of the various scenarios makes it difficult to find common patterns that can be used to address the problem of label delay. A few challenges that one might find when trying to model label delays in real-world applications are:

The data distribution is evolving over time
The delay factor cannot be influenced for analysis
The delay impacts the model in unknown ways

Our proposal

We propose a new Continual Learning setting, in which we show how does label delay impact the learning process. We consider the naïve solution of ignoring the most recently collected data and only using the samples that have already received their label and compare it to the ideal case where the labels are immediately available for all samples. We provide an extensive list of experiments (amounting to over 25K GPU hours) of trying to recover the performance of the ideal case by using the samples before their corresponding labels become available.

We use four large-scale datasets to evaluate our approach: Continual Localization (CLOC - 40M samples), Continual Google Landmarks (CGLM - 0.5M samples), Functional Map of the World (FMoW - 118K samples) and Yearbook (37K samples). As one can see in the above figures, there is a growing gap between the performance of the ideal case and the naïve solution as the delay increases. More importantly, we show that on different datasets the impact of the delay differs significantly, which highlights the importance of modeling label delay. In the figure below, we show how the performance of the ideal case (when the labels become immediately available) and the naïve solution changes as the delay increases under different computational budgets $\mathcal{C}$:

How to overcome label delay?

Even though one might not be able to influence the delay factor, we show that it is possible to recover the performance of the ideal case by using the samples before their corresponding labels become available. There are two main challenges that one needs to overcome in order to achieve this: 1) using the unlabeled samples to improve the model 2) keeping the solution computationally efficient To address these challenges our experiment allows the continual learning models to use the unlabeled samples, while normalizing the computational cost of the model to be the same as the naïve solution.

In our proposed label delay experimental setting, we show the larger the delay the more challenging it is for Naïve, a method that relies only on older labeled data, to effectively classify new samples. This is due to a larger gap in distribution between the samples used for training and for evaluation. This begs the question of whether the new unlabeled data can be used for training to improve over Naïve, as it is much more similar to the data that the model is evaluated on. We propose four different paradigms for utilizing the unlabeled data, namely, Importance Weighted Memory Sampling (IWMS), Semi-Supervised Learning via Pseudo-Labeling (PL), Self-Supervised Semi-Supervised Learning (S4L) and Test-Time Adaptation (TTA). We integrate several methods of each family into our setting and evaluate them under various delays and computational budgets. We show that our method, IWMS not only improves the performance of Naïve on three out of four datasets, but also outperforms the ideal case in two different computational budget and delay scenario:

To learn more about the details of our proposed method, please check out our live [Online Learning demo] which runs in the browser and uses the web-camera to model the real world data stream. This demo allows you to train and interact directly with the learned representation of the model, and see how the different sampling policies affect the performance of the model.

Future work

In this project, we have demonstrated the versatility of our proposed setting in modeling various label delay scenarios. A key presumption in our methodology is that the rate at which data is collected is identical to the rate at which labels are assigned. However, this assumption doesn't always hold true in practical situations. By allowing the rates of data collection and label assignment to be modeled independently, our method could be adapted for a broader array of applications where these two rates are not identical. Although our current model anticipates that each data sample will be assigned a label after a specific number of steps (exactly $d$ steps), this may not be feasible in real-world conditions where data accumulates faster than labels can be assigned, potentially leaving some samples unlabeled indefinitely. In such cases, the choice which samples are labeled and which are not is not arbitrary, but rather a strategic decision that can have a significant impact on the performance of the model. This is especially true in the case of continual learning, where the model is expected to perform well on the most recent data.

In the figure above, we implement the simplest possible strategy, where the samples are labeled in the order they are collected. We denote each annotator with their corresponding ID, therefore $\#1,\#2,\#3,...$ are the first, second, third annotators, respectively. In this example each annotator takes $d=4$ steps, and can only start labeling the next samples once they have finished labeling their assigned ones. If the ratio between the rate of data collection and the rate of label assignment is $r=1$, then after $d$ steps every sample will have received its label. However, if the ratio is $r>1$, then it means that the annotators cannot keep up with the rate of data collection and some samples will remain unlabeled indefinitely. In this case, the choice of which samples are labeled and which are not is a strategic decision that can have a significant impact on the performance of the model.

How to interact with the figure:

Data collection rate: controls how fast samples are revealed by the stream
Annotation rate: controls the annotation throughput