By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: How Do You Train an AI to Understand Time? With a Giant Pile of Data. | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > How Do You Train an AI to Understand Time? With a Giant Pile of Data. | HackerNoon
Computing

How Do You Train an AI to Understand Time? With a Giant Pile of Data. | HackerNoon

News Room
Last updated: 2025/06/30 at 8:54 PM
News Room Published 30 June 2025
Share
SHARE

Table of Links

Abstract and 1. Introduction

  1. Related Work

  2. Methodology

  3. Experimental Setup and Results

  4. Conclusion and Future Work

Acknowledgments

Reproducibility statement

Impact statement, and References

3. Methodology

We first collect a large number of public time series data into the Time Series Pile and then use it to pre-train a transformer model on the masked time series prediction task. We discuss each of these steps in the following sections.

3.1. The Time Series Pile

Unlike natural language processing and computer vision, where large-scale datasets such as The Pile (Gao et al.,2020), and ImageNet-1K (Russakovsky et al., 2015) are easily available for pre-training, public time series datasets are much smaller, scattered, and largely task-specific (Ma et al., 2023; Zhou et al., 2023; Gruver et al., 2023). To bridge this gap, we collate multiple time series from 4 taskspecific, widely-used public repositories resulting in a large number of time series spanning diverse domains, and time series characteristics such as lengths, amplitudes, and temporal resolutions. We call this collection the Time Series Pile.

Informer long-horizon forecasting datasets (Zhou et al., 2021) is a collection of 9 datasets that are widely used to evaluate long-horizon forecasting performance (Wu et al., 2023; Nie et al., 2023; Challu et al., 2023): 2 hourly and minutely subsets of the Electricity Transformer Temperature (ETT) (Zhou et al., 2021), Electricity (Trindade, 2015), Traffic (California Department of Transportation, 2024), Weather (Max Planck Institute for Biogeochemistry, 2024), Influenza-like Illness (ILI) (Centers for Disease Control and Prevention, 2024), and Exchange-rate (Lai et al., 2018).

Monash time series forecasting archive (Godahewa et al., 2021) is a collection of 58 publicly available short-horizon forecasting datasets with a total of over 100K time series, spanning a variety of domains and temporal resolutions.

UCR/UEA classification archive (Dau et al., 2018) comprises of 159 time series datasets which are frequently used to benchmark classification algorithms (Ismail Fawaz et al., 2019). These datasets belonging to seven different categories (Image Outline, Sensor Readings, Motion Capture, Spectrographs, ECG, Electric Devices, and Simulated Data), vary substantially in terms of the number of classes and the size of the training set.

TSB-UAD anomaly benchmark (Paparrizos et al., 2022b) is a recent collection of 1980 univariate time series with labeled anomalies from 18 anomaly detection datasets proposed over the past decade. This collection includes both synthetic and real-world time series originating from a wide range of sources such as the human body, spaceships, environment, and web serves.

Minimizing data contamination using careful train-test splitting. We carefully split each dataset into disjoint training, validation, and test splits, based on splits specified by data creators. When these splits are not available, we randomly sample 60% of the data for training, 10% for validation, and 30% for testing. Long-horizon forecasting and anomaly detection datasets are typically long time series, which are split horizontally as shown in Fig. 2. Conversely, short-horizon forecasting and classification datasets often contain multiple short time series. For these datasets, a complete time series is either training, validation, or testing. We use the same random seed, set to 13, throughout our experiments, from pre-training to downstream evaluation, thus ensuring that MOMENT only observes the training splits of datasets during pre-training.

3.2. Model Architecture

Figure 3. Overview of MOMENT. A time series is broken into disjoint fixed-length sub-sequences called patches, and each patch is mapped into a D-dimensional patch embedding. During pretraining, we mask patches uniformly at random by replacing their patch embeddings using a special mask embedding [MASK]. The goal of pre-training is to learn patch embeddings which can be used to reconstruct the input time series using a light-weight reconstruction head.Figure 3. Overview of MOMENT. A time series is broken into disjoint fixed-length sub-sequences called patches, and each patch is mapped into a D-dimensional patch embedding. During pretraining, we mask patches uniformly at random by replacing their patch embeddings using a special mask embedding [MASK]. The goal of pre-training is to learn patch embeddings which can be used to reconstruct the input time series using a light-weight reconstruction head.

Our transformer encoder retains the modifications proposed by Raffel et al. (2020) to the original Transformer (Vaswani et al., 2017). Specifically, we remove the additive bias from the Layer Norm (Ba et al., 2016), and place it before the residual skip connections (He et al., 2016), and use the relation positional embedding scheme (Shaw et al., 2018). Below we summarize the intuition behind some of our key design decisions.

Handling varying time series characteristics. Time series vary in length, number of channels, amplitudes, and temporal resolutions. We address variable length by restricting MOMENT’s input to a univariate time series of a fixed length T = 512. As is common practice, we sub-sample longer time series, and pad shorter ones with zeros on the left[2]. Moreover, segmenting time series into patches quadratically reduces MOMENT’s memory footprint and computational complexity, and linearly increases the length of time series it can take as input. We handle multi-variate time series by independently operating on each channel along the batch dimension. Like recent studies (Zhou et al., 2023; Nie et al., 2023), we found that modeling each channel independently is an effective strategy for modeling multivariate time series. Finally, re-scaling and centering time series using reversible instance normalization enables MOMENT to model time series with significantly different temporal distributions (Kim et al., 2022). We did not explicitly model the temporal resolution of time series, since this information is often unavailable outside of time series forecasting datasets.

Intentionally simple encoder. Closely following the design of transformers in the language domain allows us to leverage their scalable and efficient implementations (e.g., gradient checkpointing, mixed precision training).

Light-weight prediction head. We use a lightweight prediction head instead of a decoder of the same size as the encoder, to enable the necessary architectural modifications for task-specific fine-tuning of a limited number of trainable parameters while keeping the bulk of parameters and the high-level features learned by the encoder intact.

3.3. Pre-training using Masked Time series Modeling

We pre-train MOMENT using the masked time series modeling task. Fig. 3 presents an overview of our pre-training procedure. During training, we first mask a small number of patches uniformly at random by replacing their patch embeddings with a learnable mask embedding [MASK]. The corrupted time series patches are then fed into the transformer encoder to learn patch representations, which are used to reconstruct the original time series using a lightweight reconstruction head. The pre-training objective is to minimize the masked reconstruction error i.e. the Mean Squared Error between the ground truth and the prediction, averaged over the masked patches.

Pre-training Setup. We pre-train three different sizes of MOMENT, roughly corresponding to the sizes of encoders in T5-Small, Base, and Large. Specifically, the Base (Small, Large) model uses a 12 (6, 24) layer Transform with hidden dimensions of size D = 768 (512, 1024), 12

Table 1. Experimental benchmark. We evaluate MOMENT on 5 time series analysis tasks with an emphasis on limited memory, compute, and supervision settings.Table 1. Experimental benchmark. We evaluate MOMENT on 5 time series analysis tasks with an emphasis on limited memory, compute, and supervision settings.

(8, 16) attention heads, and feed-forward networks of size 3072 (2048, 4096), resulting in approximately 125 (40, 385) million parameters. All weights are randomly initialized before pre-training. All models take an input time series of length T = 512, breaking it into N = 64 disjoint patches of length P = 8. We mask 30% of the patches uniformly at random during pre-training.

3.4. Fine-tuning on Downstream Tasks

MOMENT can be seamlessly used for multiple time series analysis tasks. In this work, we consider 5 practical time series analysis tasks as examples, namely: long- and shorthorizon forecasting, classification, anomaly detection, and imputation. For forecasting tasks with horizon H, we replace the reconstruction head with a forecasting head, which first flattens all the N D-dimensional patch embeddings into a N × D dimensional vector, and then projects it into a Hdimensional time series via a linear projection layer. For all other tasks, we retain the reconstruction head. We provide detailed descriptions of each task and MOMENT’s configuration in App. E.

Authors:

(1) Mononito Goswami, Auton Lab, Robotics Insititute, Carnegie Mellon University, Pittsburgh, USA ([email protected])

(2) Konrad Szafer, Auton Lab, Robotics Institute, Carnegie Mellon University, Pittsburgh, USA, with equal contribution, order decided using a random generator;

(3) Arjun Choudhry, Auton Lab, Robotics Institute, Carnegie Mellon University, Pittsburgh, USA, with equal contribution, order decided using a random generator;

(4) Yifu Cai, Auton Lab, Robotics Institute, Carnegie Mellon University, Pittsburgh, USA;

(5) Shuo Li, University of Pennsylvania, Philadelphia, USA;

(6) Artur Dubrawski, Auton Lab, Robotics Institute, Carnegie Mellon University, Pittsburgh, USA.


[2] We found a large majority of classification datasets to have time series shorter than 512. Besides, a look-back window of length 512 was found to be sufficient for accurate long-horizon forecasting (Nie et al., 2023).

[4] https://cloud.google.com/tpu/docs/ bfloat16

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article White House: Trade talks to restart after Canada scraps digital service tax
Next Article NASA is bringing live rocket launches and spacewalks to Netflix
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

One UI 8’s Now Brief will make sure you don’t miss a birthday again
News
Cyber resiliency evolves in response to growing AI threat – News
News
Marketing or Manipulation? When Social Hype Moves the Solana Market
Computing
Love Island USA’s Huda clashes with Chris but makes up with ‘freaky’ kiss
News

You Might also Like

Computing

Marketing or Manipulation? When Social Hype Moves the Solana Market

15 Min Read
Computing

7 Short-Form Video Trends Marketers Should Know Of In 2025

20 Min Read
Computing

How to Create Social Media Teaser Videos that Generate Buzz

13 Min Read
Computing

How to Run Ads on Instagram: A Complete Guide for 2025

19 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?