Label Delay in Online Continual Learning
Botos Csaba12*, Wenxuan Zhang3*, Matthias Müller2, Ser-Nam Lim4, Mohamed Elhoseiny3, Philip Torr1, Adel Bibi1
1University of Oxford, 2Intel, 3KAUST, 4Facebook AI Research, * Equal contribution
In NeurIPS 2024
[Paper] [Code] [Demo] [Video]
Our proposed Continual Learning setting considering Label Delay allows us to model a wide range of real-world applications where new raw data is revealed significantly sooner by the data stream $\mathcal{S}_{\mathcal{X}}$ than the annotation process $\mathcal{S}_{\mathcal{Y}}$ can provide the corresponding labels. The main objective is to maximize the accuracy on the newest Eval data using both the samples that have already received their label (in colour) and the more recent samples that are yet to be labeled (in gray).
Where does label delay come from?
In many real world scenario, the time between making predictions and the feedback can be vastly different due to the inherent nature of the task. Consider the following three examples: In medical applications, the predicted post-operation recovery time of the patient is one of the most important metrics, yet the official recovery time is only established during follow-up visits. In investment banking, the time it takes to receive the results of a trade can be significantly longer than the time it takes to execute the trade itself. In the world of copyright claims, an automated trigger mechanism can prevent fraudulent usage of the content sharing platform, however the actual evaluation of each case by the owners is often significantly delayed.- The data distribution is evolving over time
- The delay factor cannot be influenced for analysis
- The delay impacts the model in unknown ways
Our proposal
We propose a new Continual Learning setting, in which we show how does label delay impact the learning process.
We consider the naïve solution of ignoring the most recently collected data and only using the samples that have already received their label and compare it to the ideal case where the labels are immediately available for all samples.
We provide an extensive list of experiments (amounting to over 25K GPU hours) of trying to recover the performance of the ideal case by using the samples before their corresponding labels become available.
We use four large-scale datasets to evaluate our approach: Continual Localization (CLOC - 40M samples), Continual Google Landmarks (CGLM - 0.5M samples), Functional Map of the World (FMoW - 118K samples) and Yearbook (37K samples).
As one can see in the above figures, there is a growing gap between the performance of the ideal case and the naïve solution as the delay increases.
More importantly, we show that on different datasets the impact of the delay differs significantly, which highlights the importance of modeling label delay.
In the figure below, we show how the performance of the ideal case (when the labels become immediately available) and the naïve solution changes as the delay increases under different computational budgets $\mathcal{C}$:
How to overcome label delay?
Even though one might not be able to influence the delay factor, we show that it is possible to recover the performance of the ideal case by using the samples before their corresponding labels become available. There are two main challenges that one needs to overcome in order to achieve this: 1) using the unlabeled samples to improve the model 2) keeping the solution computationally efficient To address these challenges our experiment allows the continual learning models to use the unlabeled samples, while normalizing the computational cost of the model to be the same as the naïve solution.Future work
In this project, we have demonstrated the versatility of our proposed setting in modeling various label delay scenarios. A key presumption in our methodology is that the rate at which data is collected is identical to the rate at which labels are assigned. However, this assumption doesn't always hold true in practical situations. By allowing the rates of data collection and label assignment to be modeled independently, our method could be adapted for a broader array of applications where these two rates are not identical. Although our current model anticipates that each data sample will be assigned a label after a specific number of steps (exactly $d$ steps), this may not be feasible in real-world conditions where data accumulates faster than labels can be assigned, potentially leaving some samples unlabeled indefinitely. In such cases, the choice which samples are labeled and which are not is not arbitrary, but rather a strategic decision that can have a significant impact on the performance of the model. This is especially true in the case of continual learning, where the model is expected to perform well on the most recent data.How to interact with the figure:
- Data collection rate: controls how fast samples are revealed by the stream
- Annotation rate: controls the annotation throughput