Forecast evaluation leaderboard

This space hosts evaluation results for time series forecasting models.

The results are obtained using fev - a lightweight library for evaluating time series forecasting models.

Chronos Benchmark II results

This tab contains results for various forecasting models on the 27 datasets used in Benchmark II in the publication Chronos: Learning the Language of Time Series.

These datasets were used for zero-shot evaluation of Chronos models (i.e., Chronos models were not trained on these datasets), but some other models did include certain datasets in their training corpus.

Each table contains the following information:

  • Average relative error: Geometric mean of the relative errors for each task. The relative error for each task is computed as model_error / baseline_error.
  • Average rank: Arithmetic mean of the ranks achieved by each model on each task.
  • Median inference time (s): Median of the times required to make predictions for the entire dataset (in seconds).
  • Training corpus overlap (%): Percentage of the datasets used in the benchmark that were included in the model's training corpus. Zero-shot models are highlighted in green.

Lower values are better for all of the above metrics.

Task definitions and the detailed results are available on GitHub. More information for the datasets is available in Table 3 of the paper.

Forecast accuracy measured by Weighted Quantile Loss.

model_name
Average relative error
Average rank
Median inference time (s)
Training corpus overlap (%)
0.6242913625772537
10.592592592592593
0.4064128090000167
0.14814814814814814