Table of Links
Abstract and 1. Introduction
-
Related Work
-
Methodology
-
Experimental Setup and Results
-
Conclusion and Future Work
Acknowledgments
Reproducibility statement
Impact statement, and References
Transformers and patching for time series modeling. There is a growing body of work utilizing transformers for various time series analysis tasks (Wen et al., 2023). One issue with applying transformers to time series data is the complexity of the self-attention mechanism, which grows quadratically with the size of input tokens (or length of time series) (Li et al., 2019). Nie et al. (2023) demonstrated that treating time series sub-sequences (or patches) as tokens instead of individual time points is a simple, efficient, and effective mechanism for learning useful representations for forecasting. Drawing inspiration from prior work, we build on top of the transformer architecture which takes disjoint time series sub-sequences (or patches) as input.
Masked Representation Learning. Masked pre-training is a widely-used self-supervised learning task where a model learns to accurately reconstruct masked portions of its input. Masked language (Devlin et al., 2019; Raffel et al., 2020) and image modeling (Xie et al., 2022; Li et al., 2023b) have been successfully utilized to learn models from vast quantities of unlabeled data, which can generalize to a variety of downstream tasks.
For time series data, prior work has primarily focused on contrastive representation learning (Yue et al., 2022; Eldele et al., 2021; Franceschi et al., 2019). However, contrastive learning relies on data augmentation, which is both subjective and data-dependent. In contrast, some studies mask portions of time series using zeros and learn a model to reconstruct them (Nie et al., 2023; Zerveas et al., 2021; Dong et al., 2023; Li et al., 2023c).
Representation learning via masking is well-suited to all the downstream tasks we care about, especially forecasting and imputation, as they are instances of the masked reconstruction problem. Due to its simplicity and success in vision and language domains, we use the masked prediction task to pretrain our model, using a special embedding (see [MASK] in Fig. 3) to mask time series patches instead of zeros.
Cross-modal transfer learning using language models. Lu et al. (2022) had first shown that transformers pre-trained on text data (LLMs) can effectively solve sequence modeling tasks in other modalities. Subsequently, Shen et al. (2023) introduced ORCA, a general cross-modal fine-tuning framework that extends the applicability of a single largescale pretrained model to diverse modalities by adapting to a target task via an align-then-refine workflow. Given the target input, ORCA first learns an embedding network that aligns the embedded feature distribution with the pretraining modality, then the pretrained model is fine-tuned on the embedded data, exploiting the knowledge shared across modalities. Some recent studies have leveraged this inherent ability of language pre-trained transformers to “reprogram” LLMs for time series analysis using parameter efficient fine-tuning and suitable tokenization strategies (Zhou et al., 2023; Gruver et al., 2023; Jin et al., 2023; Cao et al., 2023; Ekambaram et al., 2024). However, some of these models (Jin et al., 2023; Gruver et al., 2023) with billions of parameters demand significant memory and computational resources to perform well. We complement this line of research with three empirical observations (Sec 4.3): we
show that (1) transformers trained on time series can also model sequences across modalities, (2) during pre-training, randomly initializing weights lead to lower pre-training loss, than initializing with language modeling weights, and (3) models pre-trained on time series outperform LLM-based models such as (Zhou et al., 2023; Jin et al., 2023) on many tasks and datasets.
Unanswered Questions. To the best of our knowledge, two questions remain largely unanswered in prior work on time series modeling. First, all existing time series models are (pre-)trained and fine-tuned on individual datasets (Nie et al., 2023; Yue et al., 2022; Wu et al., 2023; Zhou et al., 2023), and the benefits (or drawbacks) of large-scale multi-dataset pre-training remains unexplored (Wen et al., 2023). Second, there is very limited work on time series modeling in limited supervision settings, such as zero-shot forecasting (Oreshkin et al., 2021), or few-shot classification (Narwariya et al., 2020). In our work, we consider both these questions and show that pre-training a model of sufficient capacity on a large corpus of unlabeled time series data can in fact enable it to provide reasonably accurate predictions in limited-supervision settings.
Authors:
(1) Mononito Goswami, Auton Lab, Robotics Insititute, Carnegie Mellon University, Pittsburgh, USA ([email protected])
(2) Konrad Szafer, Auton Lab, Robotics Institute, Carnegie Mellon University, Pittsburgh, USA, with equal contribution, order decided using a random generator;
(3) Arjun Choudhry, Auton Lab, Robotics Institute, Carnegie Mellon University, Pittsburgh, USA, with equal contribution, order decided using a random generator;
(4) Yifu Cai, Auton Lab, Robotics Institute, Carnegie Mellon University, Pittsburgh, USA;
(5) Shuo Li, University of Pennsylvania, Philadelphia, USA;
(6) Artur Dubrawski, Auton Lab, Robotics Institute, Carnegie Mellon University, Pittsburgh, USA.