Does The Adam Optimizer Amplify Catastrophic Forgetting?

:::info
Authors:

Dylan R. Ashley
Sina Ghiassian
Richard S. Sutton

:::

TABLE OF LINKS

Abstract

1 Introduction

2 Related Work

3 Problem Formulation

4 Measuring Catastrophic Forgetting

5 Experimental Setup

6 Results

7 Discussion

8 Conclusion

9 Future Work and References

Abstract

Catastrophic forgetting remains a severe hindrance to the broad application of artificial neural networks (ANNs), however, it continues to be a poorly understood phenomenon. Despite the extensive amount of work on catastrophic forgetting, we argue that it is still unclear how exactly the phenomenon should be quantified, and, moreover, to what degree all of the choices we make when designing learning systems affect the amount of catastrophic forgetting. We use various testbeds from the reinforcement learning and supervised learning literature to (1) provide evidence that the choice of which modern gradient-based optimization algorithm is used to train an ANN has a significant impact on the amount of catastrophic forgetting and show that—surprisingly—in many instances classical algorithms such as vanilla SGD experience less catastrophic forgetting than the more modern algorithms such as Adam. We empirically compare four different existing metrics for quantifying catastrophic forgetting and (2) show that the degree to which the learning systems experience catastrophic forgetting is sufficiently sensitive to the metric used that a change from one principled metric to another is enough to change the conclusions of a study dramatically. Our results suggest that a much more rigorous experimental methodology is required when looking at catastrophic forgetting. Based on our results, we recommend inter-task forgetting in supervised learning must be measured with both retention and relearning metrics concurrently, and intra-task forgetting in reinforcement learning must—at the very least—be measured with pairwise interference.

1 Introduction

In online learning, catastrophic forgetting refers to the tendency for artificial neural networks (ANNs) to forget previously learned information when in the presence of new information (French, 1991, p. 173). Catastrophic forgetting presents a severe issue for the broad applicability of ANNs as many important learning problems, such as reinforcement learning, are online learning problems. Efficient online learning is also core to the continual—sometimes called lifelong (Chen and Liu, 2018, p. 55)—learning problem. The existence of catastrophic forgetting is of particular relevance now as ANNs have been responsible for a number of major artificial intelligence (AI) successes in recent years (e.g., Taigman et al. (2014), Mnih et al. (2015), Silver et al. (2016), Gatys et al. (2016), Vaswani et al. (2017), Radford et al. (2019), Senior et al. (2020)). Thus there is reason to believe that methods able to successfully mitigate catastrophic forgetting could lead to new breakthroughs in online learning problems.

The significance of the catastrophic forgetting problem means that it has attracted much attention from the AI community. It was first formally reported on in McCloskey and Cohen (1989) and, since then, numerous methods have been proposed to mitigate it (e.g., Kirkpatrick et al. (2017), Lee et al. (2017), Zenke et al. (2017), Masse et al. (2018), Sodhani et al. (2020)). Despite this, it continues to be an unsolved issue (Kemker et al., 2018). This may be partly because the phenomenon itself—and what contributes to it—is poorly understood, with recent work still uncovering fundamental connections (e.g., Mirzadeh et al. (2020)). This paper is offered as a step forward in our understanding of the phenomenon of catastrophic forgetting. In this work, we seek to improve our understanding of it by revisiting the fundamental questions of (1) how we should quantify catastrophic forgetting, and (2) to what degree do all of the choices we make when designing learning systems affect the amount of catastrophic forgetting. To answer the first question, we compare several different existing measures for catastrophic forgetting: retention, relearning, activation overlap, and pairwise interference. We discuss each of these metrics in detail in Section 4. We show that, despite each of these metrics providing a principled measure of catastrophic forgetting, the relative ranking of algorithms varies wildly between them. This result suggests that catastrophic forgetting is not a phenomenon that a single one of these metrics can effectively describe. As most existing research into methods to mitigate catastrophic forgetting rarely looks at more than one of these metrics, our results imply that a more rigorous experimental methodology is required in the research community. Based on our results, we recommend that work looking at inter-task forgetting in supervised learning must, at the very least, consider both retention and relearning metrics concurrently. For intra-task forgetting in reinforcement learning, our results suggest that pairwise interference may be a suitable metric, but that activation overlap should, in general, be avoided as a singular measure of catastrophic forgetting.

To address the question of to what degree all the choices we make when designing learning systems affect the amount of catastrophic forgetting, we look at how the choice of which modern gradientbased optimizer is used to train an ANN impacts the amount of catastrophic forgetting that occurs during training. We empirically compare vanilla SGD, SGD with Momentum (Qian, 1999; Rumelhart et al., 1986), RMSProp (Hinton et al., n.d.), and Adam (Kingma and Ba, 2014), under the different metrics and testbeds. Our results suggest that selecting one of these optimizers over another does indeed result in a significant change in the catastrophic forgetting experienced by the learning system. Furthermore, our results ground previous observations about why vanilla SGD is often favoured in continual learning settings (Mirzadeh et al., 2020, p. 6): namely that it frequently experiences less catastrophic forgetting than the more sophisticated gradient-based optimizers—with a particularly pronounced reduction when compared with Adam. To the best of our knowledge, this is the first work explicitly providing strong evidence of this. Importantly, in this work, we are trying to better understand the phenomenon of catastrophic forgetting itself, and not explicitly seeking to understand the relationship between catastrophic forgetting and performance. While that relation is important, it is not the focus of this work. Thus, we defer all discussion of that relation to Appendix C of our supplementary material. The source code for our experiments is available at https://github.com/dylanashley/catastrophic-forgetting/tree/arxiv.

:::info
This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::