How Do You Train An AI To Understand Time? With A Giant Pile Of Data.

Table of Links

Abstract and 1. Introduction

Related Work
Methodology
Experimental Setup and Results
Conclusion and Future Work

Acknowledgments

Reproducibility statement

Impact statement, and References

3. Methodology

We first collect a large number of public time series data into the Time Series Pile and then use it to pre-train a transformer model on the masked time series prediction task. We discuss each of these steps in the following sections.

3.1. The Time Series Pile

Unlike natural language processing and computer vision, where large-scale datasets such as The Pile (Gao et al.,2020), and ImageNet-1K (Russakovsky et al., 2015) are easily available for pre-training, public time series datasets are much smaller, scattered, and largely task-specific (Ma et al., 2023; Zhou et al., 2023; Gruver et al., 2023). To bridge this gap, we collate multiple time series from 4 taskspecific, widely-used public repositories resulting in a large number of time series spanning diverse domains, and time series characteristics such as lengths, amplitudes, and temporal resolutions. We call this collection the Time Series Pile.

Informer long-horizon forecasting datasets (Zhou et al., 2021) is a collection of 9 datasets that are widely used to evaluate long-horizon forecasting performance (Wu et al., 2023; Nie et al., 2023; Challu et al., 2023): 2 hourly and minutely subsets of the Electricity Transformer Temperature (ETT) (Zhou et al., 2021), Electricity (Trindade, 2015), Traffic (California Department of Transportation, 2024), Weather (Max Planck Institute for Biogeochemistry, 2024), Influenza-like Illness (ILI) (Centers for Disease Control and Prevention, 2024), and Exchange-rate (Lai et al., 2018).

Monash time series forecasting archive (Godahewa et al., 2021) is a collection of 58 publicly available short-horizon forecasting datasets with a total of over 100K time series, spanning a variety of domains and temporal resolutions.

UCR/UEA classification archive (Dau et al., 2018) comprises of 159 time series datasets which are frequently used to benchmark classification algorithms (Ismail Fawaz et al., 2019). These datasets belonging to seven different categories (Image Outline, Sensor Readings, Motion Capture, Spectrographs, ECG, Electric Devices, and Simulated Data), vary substantially in terms of the number of classes and the size of the training set.

TSB-UAD anomaly benchmark (Paparrizos et al., 2022b) is a recent collection of 1980 univariate time series with labeled anomalies from 18 anomaly detection datasets proposed over the past decade. This collection includes both synthetic and real-world time series originating from a wide range of sources such as the human body, spaceships, environment, and web serves.

Minimizing data contamination using careful train-test splitting. We carefully split each dataset into disjoint training, validation, and test splits, based on splits specified by data creators. When these splits are not available, we randomly sample 60% of the data for training, 10% for validation, and 30% for testing. Long-horizon forecasting and anomaly detection datasets are typically long time series, which are split horizontally as shown in Fig. 2. Conversely, short-horizon forecasting and classification datasets often contain multiple short time series. For these datasets, a complete time series is either training, validation, or testing. We use the same random seed, set to 13, throughout our experiments, from pre-training to downstream evaluation, thus ensuring that MOMENT only observes the training splits of datasets during pre-training.

3.2. Model Architecture

Our transformer encoder retains the modifications proposed by Raffel et al. (2020) to the original Transformer (Vaswani et al., 2017). Specifically, we remove the additive bias from the Layer Norm (Ba et al., 2016), and place it before the residual skip connections (He et al., 2016), and use the relation positional embedding scheme (Shaw et al., 2018). Below we summarize the intuition behind some of our key design decisions.

Handling varying time series characteristics. Time series vary in length, number of channels, amplitudes, and temporal resolutions. We address variable length by restricting MOMENT’s input to a univariate time series of a fixed length T = 512. As is common practice, we sub-sample longer time series, and pad shorter ones with zeros on the left[2]. Moreover, segmenting time series into patches quadratically reduces MOMENT’s memory footprint and computational complexity, and linearly increases the length of time series it can take as input. We handle multi-variate time series by independently operating on each channel along the batch dimension. Like recent studies (Zhou et al., 2023; Nie et al., 2023), we found that modeling each channel independently is an effective strategy for modeling multivariate time series. Finally, re-scaling and centering time series using reversible instance normalization enables MOMENT to model time series with significantly different temporal distributions (Kim et al., 2022). We did not explicitly model the temporal resolution of time series, since this information is often unavailable outside of time series forecasting datasets.

Intentionally simple encoder. Closely following the design of transformers in the language domain allows us to leverage their scalable and efficient implementations (e.g., gradient checkpointing, mixed precision training).

Light-weight prediction head. We use a lightweight prediction head instead of a decoder of the same size as the encoder, to enable the necessary architectural modifications for task-specific fine-tuning of a limited number of trainable parameters while keeping the bulk of parameters and the high-level features learned by the encoder intact.

3.3. Pre-training using Masked Time series Modeling

We pre-train MOMENT using the masked time series modeling task. Fig. 3 presents an overview of our pre-training procedure. During training, we first mask a small number of patches uniformly at random by replacing their patch embeddings with a learnable mask embedding [MASK]. The corrupted time series patches are then fed into the transformer encoder to learn patch representations, which are used to reconstruct the original time series using a lightweight reconstruction head. The pre-training objective is to minimize the masked reconstruction error i.e. the Mean Squared Error between the ground truth and the prediction, averaged over the masked patches.

Pre-training Setup. We pre-train three different sizes of MOMENT, roughly corresponding to the sizes of encoders in T5-Small, Base, and Large. Specifically, the Base (Small, Large) model uses a 12 (6, 24) layer Transform with hidden dimensions of size D = 768 (512, 1024), 12

(8, 16) attention heads, and feed-forward networks of size 3072 (2048, 4096), resulting in approximately 125 (40, 385) million parameters. All weights are randomly initialized before pre-training. All models take an input time series of length T = 512, breaking it into N = 64 disjoint patches of length P = 8. We mask 30% of the patches uniformly at random during pre-training.

3.4. Fine-tuning on Downstream Tasks

MOMENT can be seamlessly used for multiple time series analysis tasks. In this work, we consider 5 practical time series analysis tasks as examples, namely: long- and shorthorizon forecasting, classification, anomaly detection, and imputation. For forecasting tasks with horizon H, we replace the reconstruction head with a forecasting head, which first flattens all the N D-dimensional patch embeddings into a N × D dimensional vector, and then projects it into a Hdimensional time series via a linear projection layer. For all other tasks, we retain the reconstruction head. We provide detailed descriptions of each task and MOMENT’s configuration in App. E.

Authors:

(1) Mononito Goswami, Auton Lab, Robotics Insititute, Carnegie Mellon University, Pittsburgh, USA ([email protected])

(2) Konrad Szafer, Auton Lab, Robotics Institute, Carnegie Mellon University, Pittsburgh, USA, with equal contribution, order decided using a random generator;

(3) Arjun Choudhry, Auton Lab, Robotics Institute, Carnegie Mellon University, Pittsburgh, USA, with equal contribution, order decided using a random generator;

(4) Yifu Cai, Auton Lab, Robotics Institute, Carnegie Mellon University, Pittsburgh, USA;

(5) Shuo Li, University of Pennsylvania, Philadelphia, USA;

(6) Artur Dubrawski, Auton Lab, Robotics Institute, Carnegie Mellon University, Pittsburgh, USA.

[2] We found a large majority of classification datasets to have time series shorter than 512. Besides, a look-back window of length 512 was found to be sufficient for accurate long-horizon forecasting (Nie et al., 2023).

[4] https://cloud.google.com/tpu/docs/ bfloat16

How Do You Train an AI to Understand Time? With a Giant Pile of Data. | HackerNoon

Table of Links