Towards Robust Evaluations of Continual Learning - Barleen's corner

Towards Robust Evaluations of Continual Learning

Motivation

Recent research work relate the problem of continual learning to increasing the dataset difficulty.
Current experimental set-ups and evaluations are misleading as they are biased towards prior focused approaches.
Need to encourage continual strategies that work when all the fundamental desideratum are enforced and not just a subset of them.

Contributions

Thorough analysis of flaws in existing evaluations used by our community.
Empirically show that current evaluations are biased towards prior focused approaches.
Propose fundamental desiderata for future evaluations of continual learning strategies.
Can be applied irrespective of the dataset.
Propose new experimental set ups to overcome the issues of exiting ones.

Proposed Desiderata

- Core Desiderata

A: Cross-task resemblance : Data for new task must resemble from old task enough to produce confident predictions of old classes sometimes early in training. Permuted MNIST violates this property.
B: Shared output head
C: No test time assumed task labels
D: No unconstrained re-training on old tasks
E: More than two tasks : The more the number of tasks that a CL strategy can deal with, the better.

- Other desiderata

Unclear task demarcation
Continuous tasks
Overlapping tasks
Long task sequences
Time/Compute/Memory constraints
Strict privacy guarantees

Critical analysis of existing evaluations

Permuted MNIST doesn’t represent real world scenarios for continual learning as it violates desiderata A.
Multi-headed version of split MNIST requires knowledge of the task and classes in eah task apriori.
Prior based approaches tend to perform much better in multi-headed versions than in single headed versions.
Two task transfer is not a realistic evaluation in continual learning as algorithms might fail with more number of tasks.

Empirical analysis of existing evaluations

Experiments considering all the five core desiderata show that prior focussed methods suffer the most.
Missing any desiderata can lead to blindspots in the evaluation pipeline.
Model uncertainity can be used to detect task changes.
Training time and accuracy must be traded off. Memory can be treated like time here.

Insights and Conclusion

Current evaluations are misleading as they are biased towards prior focussed approaches.
Continual learning strategies should be evaluated by taking into consideration all the proposed core five desiderata. Single-headed split MNIST is the most simplistic experiment set-up.