Study Finds Optimizer Choice Significantly Impacts Model Retention

TABLE OF LINKS

Abstract

1 Introduction

2 Related Work

3 Problem Formulation

4 Measuring Catastrophic Forgetting

5 Experimental Setup

6 Results

7 Discussion

8 Conclusion

9 Future Work and References

2 Related Work

This section connects several closely related works to our own and examines how our work compliments them. The first of these related works, Kemker et al. (2018), directly observed how different datasets and different metrics changed the effectiveness of contemporary algorithms designed to mitigate catastrophic forgetting. Our work extends their conclusions to non-retention-based metrics and to more closely related algorithms. Hetherington and Seidenberg (1989) demonstrated that the severity of the catastrophic forgetting shown in the experiments of McCloskey and Cohen (1989) was reduced if catastrophic forgetting was measured with relearning-based rather than retention-based metrics. Our work extends their ideas to more families of metrics and a more modern experimental setting. Goodfellow et al. (2013) looked at how different activation functions affected catastrophic forgetting and whether or not dropout could be used to reduce its severity. Our work extends their work to the choice of optimizer and the metric used to quantify catastrophic forgetting.

While we provide the first formal comparison of modern gradient-based optimizers with respect to the amount of catastrophic forgetting they experience, others have previously hypothesized that there could be a potential relation. Ratcliff (1990) contemplated the effect of momentum on their classic results around catastrophic forgetting and then briefly experimented to confirm their conclusions applied under both SGD and SGD with Momentum. While they only viewed small differences, our work demonstrates that a more thorough experiment reveals a much more pronounced effect of the optimizer on the degree of catastrophic forgetting. Furthermore, our work includes the even more modern gradient-based optimizers in our comparison (i.e., RMSProp and Adam), which—as noted by Mirzadeh et al. (2020, p. 6)—are oddly absent from many contemporary learning systems designed to mitigate catastrophic forgetting.

3 Problem Formulation

In this section, we define the two problem formulations we will be considering in this work. These problem formulations are online supervised learning and online state value estimation in undiscounted, episodic reinforcement learning. The supervised learning task is to learn a mapping f : R n → R from a set of examples (x0, y0), (x1, y1), …, (xn, yn). The supervised learning framework is a general one as each xi could be anything from an image to the full text of a book, and each yi could be anything from the name of an animal to the average amount of time needed to read something. In the incremental online variant of supervised learning, each example (xt, yt) only becomes available to the learning system at time t and the learning system is expected to learn from only this example at time t. Reinforcement learning considers an agent interacting with an environment. Often this is formulated as a Markov Decision Process, where, at each time step t, the agent observes the current state of the environment St ∈ S, takes an action At ∈ A, and, for having taken action At when the environment is in state St, subsequently receives a reward Rt+1 ∈ R. In episodic reinforcement learning, this continues until the agent reaches a terminal state ST ∈ T ⊂ S. In undiscounted policy evaluation in reinforcement learning, the goal is to learn, for each state, the expected sum of rewards received before the episode terminates when following a given policy (Sutton and Barto, 2018, p. 74). Formally

where π is the policy mapping states to actions, and T is the number of steps left in the episode. We refer to vπ(s) as the value of state s under policy π. In the incremental online variant of value estimation in undiscounted episodic reinforcement learning, each transition (St−1, Rt, St) only becomes available to the learning system at time t and the learning system is expected to learn from only this transition at time t.

:::info
Authors: