Authors:
(1) Anonymous authors Paper under double-blind review Jarrod Haas, SARlab, Department of Engineering Science Simon Fraser University; Digitalist Group Canada and [email protected];
(2) William Yolland, MetaOptima and [email protected];
(3) Bernhard Rabus, SARlab, Department of Engineering Science, Simon Fraser University and [email protected].
- Abstract and 1 Introduction
- 2 Background
- 2.1 Problem Definition
- 2.2 Related Work
- 2.3 Deep Deterministic Uncertainty
- 2.4 L2 Normalization of Feature Space and Neural Collapse
- 3 Methodology
- 3.1 Models and Loss Functions
- 3.2 Measuring Neural Collapse
- 4 Experiments
- 4.1 Faster and More Robust OoD Results
- 4.2 Linking Neural Collapse with OoD Detection
- 5 Conclusion and Future Work, and References
- A Appendix
- A.1 Training Details
- A.2 Effect of L2 Normalization on Softmax Scores for OoD Detection
- A.3 Fitting GMMs on Logit Space
- A.4 Overtraining with L2 Normalization
- A.5 Neural Collapse Measurements for NC Loss Intervention
- A.6 Additional Figures
Abstract
We propose a simple modification to standard ResNet architectures–L2 normalization over feature space–that substantially improves out-of-distribution (OoD) performance on the previously proposed Deep Deterministic Uncertainty (DDU) benchmark. We show that this change also induces early Neural Collapse (NC), an effect linked to better OoD performance. Our method achieves comparable or superior OoD detection scores and classification accuracy in a small fraction of the training time of the benchmark. Additionally, it substantially improves worst case OoD performance over multiple, randomly initialized models. Though we do not suggest that NC is the sole mechanism or a comprehensive explanation for OoD behaviour in deep neural networks (DNN), we believe NC’s simple mathematical and geometric structure can provide a framework for analysis of this complex phenomenon in future work.
1 Introduction
It is well known that Deep Neural Networks (DNNs) lack robustness to distribution shift and may not reliably indicate failure when receiving out of distribution (OoD) inputs (Rabanser et al., 2018; Chen et al., 2020). Specifically, networks may give confident predictions in cases where inputs are completely irrelevant, e.g. an image of a plane input into a network trained to classify dogs or cats may produce high confidence scores for either dogs or cats. This inability for networks to “know what they do not know” hinders the application of machine learning in engineering and other safety critical domains (Henne et al., 2020).
A number of recent developments have attempted to address this problem, the most widely used being Monte Carlo Dropout (MCD) and ensembles (Gal and Ghahramani, 2016; Lakshminarayanan et al., 2017). While supported by a reasonable theoretical background, MCD lacks performance in some applications and requires multiple forward passes of the model after training (Haas and Rabus, 2021; Ovadia et al., 2019). Ensembles can provide better accuracy than MCD, as well as better OoD detection under larger distribution shifts, but require a substantial increase in compute (Ovadia et al., 2019).
These limitations have spurred interest in deterministic and single forward pass methods. Notable amongst these is Deep Deterministic Uncertainty (DDU) (Mukhoti et al., 2021). DDU is much simpler than many competing approaches (Liu et al., 2020; Van Amersfoort et al., 2020; van Amersfoort et al., 2021), produces competitive results and has been proposed as a benchmark for uncertainty methods. A limitation, as shown in our experiments, is that DDU requires long training times and produces models with inconsistent performance.
We demonstrate that DDU can be substantially improved via L2 normalization over feature space in standard ResNet architectures. Beyond offering performance gains in accuracy and OoD detection, L2 normalization induces neural collapse (NC) much earlier than standard training. NC was recently found to occur in many NN architectures when they are overtrained (Papyan et al., 2020). This may provide a way to render the complexity of deep neural networks more tractable, such that they can be analyzed through the relative geometric and mathematical simplicity of simplex Equiangular Tight Frames (simplex ETF) (Mixon et al., 2022; Zhu et al., 2021; Lu and Steinerberger, 2020; Ji et al., 2021). Although this simplex ETF is limited to the feature layer and decision classifier, these layers summarize a substantial amount of network functionality. While Papyan et al. demonstrate increased adversarial robustness under NC, to the best of our knowledge, we present the first study of the relationship between OoD detection and NC.
We summarize our contributions as follows:
1)L2 normalization over the feature space of deep learning models results in OoD detection and classification performance that is competitive with or exceeds performance of the DDU benchmark. Most notably, the worst case OoD detection performance across model seeds is substantially improved.
2)Models trained with L2 normalization over feature space produce the aforementioned performance benefits in 17% (ResNet18) to 29% (ResNet50) of the training time of the DDU benchmark. Our proposed L2 normalization does not add any significant training time versus models without it.
3)L2 normalization over feature space induces NC as much as five times faster than standard training. Controlling the rate of NC may be useful for analyzing DNN behaviour.
4)NC is linked with OoD detection under our proposed modification to the DDU method. We show evidence that fast NC plays a role in achieving OoD detection performance with less training, and that training directly on NC has a substantially different effect on OoD performance than standard cross entropy (CE) training. This connection between simplex ETFs that naturally arise in DNNs and OoD performance permits an elegant analytical framework for further study of the underlying mechanisms that govern uncertainty and robustness in DNNs.