Authors:
(1) Vijay Ekambaram, IBM Research;
(2) Arindam Jati, IBM Research;
(3) Nam H. Nguyen, IBM Research;
(4) Pankaj Dayama, IBM Research;
(5) Chandra Reddy, IBM Research;
(6) Wesley M. Gifford, IBM Research;
(7) Jayant Kalagnanam, IBM Research.
Editor’s note: this is part 3 of 5 of a study detailing the development of a tiny, fast AI model that delivers excellent accuracy. Read the rest below.
Table of Links
3 TTM Workflows
TTM works in 2 stages: pre-train and fine-tune (Figure 1(a)).
3.1 Pre-training Workflow
Multi-Resolution Pre-training via TTM Backbone
The majority of the pre-training happens in the TTM backbone. The primary challenge with the proposed pre-training technique is that the pre-training data is diverse and has multiple resolutions. There are two main options for pre-training: conducting separate pre-training for each resolution type or pre-training using all resolution data collectively. While it’s common to train a model per resolution type to overcome challenges in learning diverse seasonal patterns, this leads to diminished training data for each resolution due to limited data availability. Consequently, this motivated the exploration of pre-training a single model using datasets from all resolutions. To achieve this, we propose the following 3 enhancements.
Data Augmentation via Downsampling: A significant challenge in TS pre-training datasets is the scarcity of public datasets at specific resolutions. To overcome this, we employ a downsampling technique for high-resolution datasets, generating multiple datasets at lower resolutions. For example, from a one-second resolution dataset, we derive datasets at minute and hour resolutions. Note that, the original high-resolution dataset remains within the pool of pre-training datasets. This methodology significantly augments the number of datasets for each resolution which greatly improves the model performance (Section 4.5).
Resolution Prefix Tuning: This technique explicitly learns and incorporates a new patch embedding as a prefix into the input data based on the input resolution type (see Figure 1(b)). Similar to the concept of prefix tuning [Li and Liang, 2021], this approach provides an explicit signal to the model about the resolution type for resolution-conditioned modeling. First, we map every resolution to a unique integer, which is then passed through an embedding layer to project it to the hidden dimension, hf. Subsequently, we expand the embedding across all channels to have a representation of shape c×1×hf. This module is optional for the TTM backbone, particularly beneficial when the context length (sl) is short. In these scenarios, automatically detecting the resolution becomes a challenge for the model. Hence, by explicitly fusing the resolution information as a prefix, we can enhance the model’s ability to learn effectively across resolutions.
3.2 Fine-tuning Workflow
In the fine-tuning workflow, we deal with data from the target domain that has no overlap with the pre-training datasets. We have three options here: (a) In Zero-shot forecasting, we directly use the pre-trained model to evaluate on the test part of the target data, (b) In Few-shot forecasting, we utilize only a tiny portion (5-10%) of the train part of the target data to quickly update the pre-trained weights of the decoder and head, and subsequently, evaluate it on the test part, (c) In Full-shot forecasting, we fine-tune the pre-trained weights of the decoder and head on the entire train part of the target data, and then, evaluate on the test part.
The backbone is completely frozen during fine-tuning, and still operates in a channel-independent univariate fashion. However, the TTM decoder can be fine-tuned via channel-mixing (for multivariate) or a channel-independent (for univariate) way based on the nature of the target data. If pure multivariate modeling is needed, then the channel-mixer block in all the TSMixer components (see Figure 1(b)) in the decoder gets enabled to explicitly capture the channel correlation between the channels. The forecast head and reverse normalization perform similar operations as in the pretraining stage. The fine-tuning also optimizes the forecasting objective with MSE loss. This thoughtful multi-level design choice ensures our backbone excels in channel-independent pre-training, enabling effective temporal correlation modeling across diverse datasets. Simultaneously, the decoder handles target-data-specific tasks like channel-correlation modeling and fine-tuning. In addition, if the target data has exogenous variables, then an exogenous mixer block is applied to the actual forecasts as explained next.