Efficient Transformers For Astronomical Images: Deconvolution And Denoising Unleashed

Authors:

(1) Hyosun park, Department of Astronomy, Yonsei University, Seoul, Republic of Korea;

(2) Yongsik Jo, Artificial Intelligence Graduate School, UNIST, Ulsan, Republic of Korea;

(3) Seokun Kang, Artificial Intelligence Graduate School, UNIST, Ulsan, Republic of Korea;

(4) Taehwan Kim, Artificial Intelligence Graduate School, UNIST, Ulsan, Republic of Korea;

(5) M. James Jee, Department of Astronomy, Yonsei University, Seoul, Republic of Korea and Department of Physics and Astronomy, University of California, Davis, CA, USA.

Table of Links

Abstract and 1 Introduction

2 Method

2.1. Overview and 2.2. Encoder-Decoder Architecture

2.3. Transformers for Image Restoration

2.4. Implementation Details

3 Data and 3.1. HST Dataset

3.2. GalSim Dataset

3.3. JWST Dataset

4 JWST Test Dataset Results and 4.1. PSNR and SSIM

4.2. Visual Inspection

4.3. Restoration of Morphological Parameters

4.4. Restoration of Photometric Parameters

5 Application to real HST Images and 5.1. Restoration of Single-epoch Images and Comparison with Multi-epoch Images

5.2. Restoration of Multi-epoch HST Images and Comparison with Multi-epoch JWST Images

6 Limitations

6.1. Degradation in Restoration Quality Due to High Noise Level

6.2. Point Source Recovery Test

6.3. Artifacts Due to Pixel Correlation

7 Conclusions and Acknowledgements

Appendix: A. Image restoration test with Blank Noise-Only Images

References

2. METHOD

2.1. Overview

Throughout the paper, we use the term restoration to refer to the process that simultaneously improves resolution and reduces noise. Our goal is to restore the HST-quality image to the JWST-quality image based on the Restormer implementation (Zamir et al. 2022) of the Transformer architecture (Vaswani et al. 2017). We first briefly review the encoder-decoder architecture in §2.2. §2.3 describes the Transformer architecture including the Zamir et al. (2022)’s implementation. Implementation details are presented in §2.4.

2.2. Encoder-Decoder Architecture

The encoder-decoder architecture allows neural networks to learn to map input data to output data in a structured and hierarchical manner. The encoder captures characterizing features from the input data and encodes them into a compressed representation, while the decoder reconstructs or generates the desired output based on this encoded representation. This architecture has been widely used in various applications, including image-to-image translation, image segmentation, language translation, and more.

U-net (Ronneberger et al. 2015) is a classic example of an encoder-decoder architecture based on CNN. It consists of a contracting path, which serves as the encoder, an expanding path, which functions as the decoder, and skip connections that link corresponding layers between the contracting and expanding paths. In the convolutional layers of the contracting path, the dimension is reduced by increasing the number of channels to capture

Figure 1. Restormer architecture employed in this paper for galaxy image restoration. The architecture uses a multi-scale hierarchical design incorporating efficient Transformer blocks. The two core modules of the Transformer block are MDTA and GDFN (Zamir et al. 2022).

essential image features. The expanding path employs only low-dimensional encoded information to decrease channel numbers and increase dimensions, aiming to restore high-dimensional images. To mitigate information loss in the contracting path, the skip connection is employed to concatenate features obtained from each layer of the encoding stage with each layer of the decoding stage.

2.3. Transformers for Image Restoration

In Transformer, the encoder consists of multiple layers of self-attention mechanisms followed by position-wise feed-forward neural networks. “Attention” refers to a mechanism that allows models to focus on specific parts of input data while processing it. It enables the model to selectively weigh different parts of the input, giving more importance to relevant information and ignoring irrelevant or less important parts. The key idea behind attention is to dynamically compute weights for different parts of the input data, such as words in a sentence or pixels in an image, based on their relevance to the current task. In self-attention, each element (e.g., word or pixel) in the input sequence is compared to every other element to compute attention weights, which represent the importance of each element with respect to others. These attention weights are then used to compute a weighted sum of the input elements, resulting in an attention-based representation that highlights relevant information.

The Transformer decoder also consists of multiple layers of self-attention mechanisms, along with additional attention mechanisms over the encoder’s output. The decoder predicts one element of the output sequence at a time, conditioned on the previously generated elements and the encoded representation of the input sequence.

The Transformer architecture was initially proposed and applied to the task of machine translation, which involves translating text from one language to another. The success of the Transformer in machine translation tasks demonstrated its effectiveness in capturing longrange dependencies in sequences and handling sequential data more efficiently than traditional architectures. This breakthrough sparked widespread interest in the Transformer architecture, leading to its adoption and adaptation for various image processing tasks. Transformers show promising results in tasks such as image classification, object detection, semantic segmentation, and image generation, traditionally dominated by CNNs. Transformer models capture long-range pixel correlations more effectively than CNN-based models.

However, using the Transformer model on large images becomes challenging with its original implementation, which applies self-attention layers on pixels. This is because the computational complexity escalates quadratically with the pixel count. Zamir et al. (2022) overcame this obstacle by substituting the original selfattention block with the MDTA block, which implements self-attention in the feature domain and makes the complexity increase only linearly with the number of pixels. We propose to use Zamir et al. (2022)’s efficient Transformer Restormer to apply deconvolution and denoising to astronomical images. We briefly describe the two core components of Restormer in §2.3.1 and §2.3.2. Readers are referred to Zamir et al. (2022) for more technical details.

2.3.1. MDTA block

MDTA stands as a crucial module within Restormer. By performing self-attention in the channel dimension, MDTA calculates interactions between channels in the input feature map, creating query-key interactions. Through this process, MDTA effectively models interactions between channels in the input feature map, facilitating the learning of the global context necessary for image restoration tasks.

MDTA also employs depth-wise convolution to accentuate local context. This enables MDTA to emphasize the local context of the input image, ultimately allowing for the modeling of both global and local contexts.

2.3.2. GDFN block

GDFN, short for Gated-Dconv Feed-Forward Network, stands as another crucial module within Restormer. Utilizing a gating mechanism to enhance the Feed-Forward Network, GDFN offers improved information flow, resulting in high-quality outcomes for image restoration tasks

GDFN controls the information flow through gating layers, composed of element-wise multiplication of two linear projection layers, one of which is activated by the Gaussian Error Linear Unit (GELU) non-linearity. This allows GDFN to suppress less informative features and hierarchically transmit only valuable information. Similar to the MDTA module, GDFN employs local content mixing. Through this, GDFN emphasizes the local context of the input image, providing a more robust information flow for enhanced results in image restoration tasks.

2.4. Implementation Details

We use a transfer learning approach, where a model trained on one dataset is reused for another related dataset. First, we train on the pre-training dataset (simplified galaxy images) for 150,000 iterations, followed by an additional 150,000 iterations on the finetuning dataset (realistic galaxy images). The batch size remains fixed at 64. Additionally, to compare and analyze the performance of training on individual datasets, we also conduct training separately solely based on either the pre-training or finetuning datasets. Our inference model is publically available [1].

[1] https://github.com/JOYONGSIK/GalaxyRestoration

Efficient Transformers for Astronomical Images: Deconvolution and Denoising Unleashed | HackerNoon

Table of Links