Authors:
(1) Anonymous authors Paper under double-blind review Jarrod Haas, SARlab, Department of Engineering Science Simon Fraser University; Digitalist Group Canada and [email protected];
(2) William Yolland, MetaOptima and [email protected];
(3) Bernhard Rabus, SARlab, Department of Engineering Science, Simon Fraser University and [email protected].
- Abstract and 1 Introduction
- 2 Background
- 2.1 Problem Definition
- 2.2 Related Work
- 2.3 Deep Deterministic Uncertainty
- 2.4 L2 Normalization of Feature Space and Neural Collapse
- 3 Methodology
- 3.1 Models and Loss Functions
- 3.2 Measuring Neural Collapse
- 4 Experiments
- 4.1 Faster and More Robust OoD Results
- 4.2 Linking Neural Collapse with OoD Detection
- 5 Conclusion and Future Work, and References
- A Appendix
- A.1 Training Details
- A.2 Effect of L2 Normalization on Softmax Scores for OoD Detection
- A.3 Fitting GMMs on Logit Space
- A.4 Overtraining with L2 Normalization
- A.5 Neural Collapse Measurements for NC Loss Intervention
- A.6 Additional Figures
5 Conclusion and Future Work
We propose a simple, one-line-of-code modification of the Deep Deterministic Uncertainty benchmark that provides superior OoD detection and classification accuracy results in a fraction of the training time. We also establish that L2 normalization induces NC faster than regular training, and that NC is linked to OoD detection performance under the DDU method. Although we do not suggest that NC is the sole explanation for OoD performance, we do expect that its simple structure can provide insight into the complex and poorly understood behaviour of uncertainty in deep neural networks. We believe that this connection is a compelling area of future research into uncertainty and robustness in DNNs.
References
Jiefeng Chen, Yixuan Li, Xi Wu, Yingyu Liang, and Somesh Jha. Robust out-of-distribution detection in neural networks. CoRR, abs/2003.09711, 2020. URL https://arxiv.org/abs/2003.09711.
Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059. PMLR, 2016.
Jarrod Haas and Bernhard Rabus. Uncertainty estimation for deep learning-based segmentation of roads in synthetic aperture radar imagery. Remote Sensing, 13(8), 2021. ISSN 2072-4292. doi: 10.3390/rs13081472. URL https://www.mdpi.com/2072-4292/13/8/1472.
Matthias Hein, Maksym Andriushchenko, and Julian Bitterwolf. Why relu networks yield high-confidence predictions far away from the training data and how to mitigate the problem. CoRR, abs/1812.05720, 2018. URL http://arxiv.org/abs/1812.05720.
Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. CoRR, abs/1610.02136, 2016. URL http://arxiv.org/abs/1610.02136.
Maximilian Henne, Adrian Schwaiger, Karsten Roscher, and Gereon Weiss. Benchmarking uncertainty estimation methods for deep learning with safety-related metrics. In SafeAI@ AAAI, pages 83–90, 2020.
Yen-Chang Hsu, Yilin Shen, Hongxia Jin, and Zsolt Kira. Generalized ODIN: detecting out-of-distribution image without learning from out-of-distribution data. CoRR, abs/2002.11297, 2020. URL https://arxiv. org/abs/2002.11297.
Wenlong Ji, Yiping Lu, Yiliang Zhang, Zhun Deng, and Weijie J Su. An unconstrained layer-peeled perspective on neural collapse. arXiv preprint arXiv:2110.02796, 2021.
Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017.
Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting outof-distribution samples and adversarial attacks. Advances in neural information processing systems, 31, 2018
Shiyu Liang, Yixuan Li, and R. Srikant. Principled detection of out-of-distribution examples in neural networks. CoRR, abs/1706.02690, 2017. URL http://arxiv.org/abs/1706.02690.
Jeremiah Liu, Zi Lin, Shreyas Padhy, Dustin Tran, Tania Bedrax Weiss, and Balaji Lakshminarayanan. Simple and principled uncertainty estimation with deterministic deep learning via distance awareness. Advances in Neural Information Processing Systems, 33:7498–7512, 2020.
Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 212–220, 2017.
Jianfeng Lu and Stefan Steinerberger. Neural collapse with cross-entropy loss. arXiv preprint arXiv:2012.08465, 2020.
Dustin G Mixon, Hans Parshall, and Jianzong Pi. Neural collapse with unconstrained features. Sampling Theory, Signal Processing, and Data Analysis, 20(2):1–13, 2022.
Jishnu Mukhoti, Andreas Kirsch, Joost van Amersfoort, Philip H. S. Torr, and Yarin Gal. Deterministic neural networks with appropriate inductive biases capture epistemic and aleatoric uncertainty. CoRR, abs/2102.11582, 2021. URL https://arxiv.org/abs/2102.11582.
Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, David Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. Advances in neural information processing systems, 32, 2019.
Vardan Papyan, XY Han, and David L Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652–24663, 2020.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/ 9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
Stephan Rabanser, Stephan Günnemann, and Zachary C. Lipton. Failing loudly: An empirical study of methods for detecting dataset shift, 2018. URL https://arxiv.org/abs/1810.11953.
Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, and Hervé Jégou. Spreading vectors for similarity search. arXiv preprint arXiv:1806.03198, 2018.
Lewis Smith, Joost van Amersfoort, Haiwen Huang, Stephen J. Roberts, and Yarin Gal. Can convolutional resnets approximately preserve input distances? A frequency analysis perspective. CoRR, abs/2106.02469, 2021. URL https://arxiv.org/abs/2106.02469.
Joost Van Amersfoort, Lewis Smith, Yee Whye Teh, and Yarin Gal. Uncertainty estimation using a single deep deterministic neural network. In International conference on machine learning, pages 9690–9700. PMLR, 2020.
Joost van Amersfoort, Lewis Smith, Andrew Jesson, Oscar Key, and Yarin Gal. On feature collapse and deep kernel learning for single forward pass uncertainty. arXiv preprint arXiv:2102.11409, 2021.
Jian Zheng, Jingyi Li, Cong Liu, Jianfeng Wang, Jiang Li, and Hongling Liu. Anomaly detection for highdimensional space using deep hypersphere fused with probability approach. Complex & Intelligent Systems, pages 1–16, 2022.
Chang Zhou, Lai Man Po, and Weifeng Ou. Angular deep supervised vector quantization for image retrieval. IEEE Transactions on Neural Networks and Learning Systems, 2020.
Zhihui Zhu, Tianyu Ding, Jinxin Zhou, Xiao Li, Chong You, Jeremias Sulam, and Qing Qu. A geometric analysis of neural collapse with unconstrained features. Advances in Neural Information Processing Systems, 34:29820–29834, 2021.
A Appendix
A.1 Training Details
All models (except those explicitly noted in the ablation study) use spectral normalization, leaky ReLUs and Global Average Pooling (GAP), as these produce the strongest baselines. Each experiment was conducted with fifteen randomly initialized model parameter sets; no fixed seeds were used at any time for initialization. We set the batch size to 1024 for all training runs, except the NC intervention models, which were more stable when training with a batch size of 2048. All training was conducted on four NVIDIA V100 GPUs in PyTorch 1.10.1 Paszke et al. (2019).
Stochastic gradient descent (SGD) with an initial learning rate of 1e −1 was used as the optimizer for all experiments. We used a learning rate schedule that decreased by one order of magnitude at 150 and 250 epochs for the 350 epoch models, as per the DDU benchmark. We adjust the learning rate at 75 and 90 for the 100 epoch ResNet50 models, and at 40 and 50 for the 60 epoch ResNet18 models. Models were trained on the standard CIFAR-10 training data set with a validation size of 10% created with a fixed random seed.
A.2 Effect of L2 Normalization on Softmax Scores for OoD Detection
Table 4: OoD detection results using (a) log probabilities from a GMM fitted over feature space and (b) softmax scores. ResNet18 and ResNet50 models were used, 15 seeds per experiment, trained on CIFAR10, with SVHN, CIFAR100 and Tiny ImageNet test sets used as OoD data. For all models, we indicate whether L2 normalization over feature space was used (L2/No L2) and how many training epochs occurred (60/100/350), and compare against baseline (No L2 350). There is no clear pattern of behaviour when using softmax scores for OoD detection, but using GMMs provides superior results.
A.3 Fitting GMMs on Logit Space
Table 4 shows the results of experiments with GMMs fit over logit space. This approach performs worse than GMMs fit over feature space in all cases. Intuitively, this makes sense: even under perfect NC, we would expect OoD inputs to increase the variability of class clusters in arbitrary dimensions of feature space. A Singular Value Decomposition (SVD) over feature space supports our intuitions. In Figure 6, we show the SVD of all training embeddings for CIFAR10, along with the singular values for the test set and SVHN OoD test set projected onto the the same basis used for the training singular values. As we would expect, the first 10 singular values contain nearly all information. However, the latter 502 singular values contain significantly more information in the OoD case. This information is critical to identifying OoD examples in feature space and, due to dimensionality reduction, is severely reduced in logit space.
A.4 Overtraining with L2 Normalization
Table 6 shows the results of overtraining with L2 Normalization (L2 350). While there is not a substantial penalty for overtraining by 10 to 100 epochs (Figure 5, Right), training for the full 350 epochs (as with the DDU baseline) starts to reduce OoD performance by a few percentage points. We note that there is a tradeoff with accuracy, which does increase when overtraining to 350 epochs.
Table 6: OoD detection (a) and classification accuracy results (b) for ResNet18 and ResNet50 models, 15 seeds per experiment, trained on CIFAR10, with SVHN, CIFAR100 and Tiny ImageNet test sets used as OoD data. For all models, we indicate whether L2 normalization over feature space was used (L2/No L2) and how many training epochs occurred (60/100/350), and compare against baseline (No L2 350).
A.5 Neural Collapse Measurements for NC Loss Intervention
A.6 Additional Figures