A benchmark dataset for emulating atmospheric radiative transfer in weather and climate models with machine learning (NeurIPS 2021 Datasets and Benchmarks Track)

Overview

ClimART - A Benchmark Dataset for Emulating Atmospheric Radiative Transfer in Weather and Climate Models

Python PyTorch CC BY 4.0

Official PyTorch Implementation

Using deep learning to optimise radiative transfer calculations.

Preliminary paper to appear at NeurIPS 2021 Datasets Track: https://openreview.net/forum?id=FZBtIpEAb5J

Abstract: Numerical simulations of Earth's weather and climate require substantial amounts of computation. This has led to a growing interest in replacing subroutines that explicitly compute physical processes with approximate machine learning (ML) methods that are fast at inference time. Within weather and climate models, atmospheric radiative transfer (RT) calculations are especially expensive. This has made them a popular target for neural network-based emulators. However, prior work is hard to compare due to the lack of a comprehensive dataset and standardized best practices for ML benchmarking. To fill this gap, we build a large dataset, ClimART, with more than 10 million samples from present, pre-industrial, and future climate conditions, based on the Canadian Earth System Model. ClimART poses several methodological challenges for the ML community, such as multiple out-of-distribution test sets, underlying domain physics, and a trade-off between accuracy and inference speed. We also present several novel baselines that indicate shortcomings of datasets and network architectures used in prior work.

Contact: Venkatesh Ramesh (venka97 at gmail) or Salva Rühling Cachay (salvaruehling at gmail).

Overview:

  • climart/: Package with the main code, baselines and ML training logic.
  • notebooks/: Notebooks for visualization of data.
  • analysis/: Scripts to create visualization of the results (requires logging).
  • scripts/: Scripts to train and evaluate models, and to download the whole ClimART dataset.

Getting Started

Requirements

  • Linux and Windows are supported, but we recommend Linux for performance and compatibility reasons.
  • NVIDIA GPUs with at least 8 GB of memory and system with 12 GB RAM (More RAM is required if training with --load_train_into_mem option which allows for faster training). We have done all testing and development using NVIDIA V100 GPUs.
  • 64-bit Python >=3.7 and PyTorch >=1.8.1. See https://pytorch.org/ for PyTorch install instructions.
  • Python libraries mentioned in ``env.yml`` file, see Getting Started (Need to have miniconda/conda installed).

Downloading the ClimART Dataset

By default, only a subset of CLimART is downloaded. To download the train/val/test years you want, please change the loop in ``data_download.sh.`` appropriately. To download the whole ClimART dataset, you can simply run

bash scripts/download_climart_full.sh 

conda env create -f env.yml   # create new environment will all dependencies
conda activate climart  # activate the environment called 'climart'
bash data_download.sh  # download the dataset (or a subset of it, see above)
# For one of {CNN, GraphNet, GCN, MLP}, run the model with its lowercase name with the following commmand:
bash scripts/train_<model-name>.sh

Dataset Structure

To avoid storage redundancy, we store one single input array for both pristine- and clear-sky conditions. The dimensions of ClimART’s input arrays are:

  • layers: (N, 49, D-lay)
  • levels: (N, 50, 4)
  • globals: (N, 82)

where N is the data dimension (i.e. the number of examples of a specific year, or, during training, of a batch), 49 and 50 are the number of layers and levels in a column respectively. Dlay, 4, 82 is the number of features/channels for layers, levels, globals respectively.

For pristine-sky Dlay = 14, while for clear-sky Dlay = 45, since it contains extra aerosol related variables. The array for pristine-sky conditions can be easily accessed by slicing the first 14 features out of the stored array, e.g.: pristine_array = layers_array[:, :, : 14]

The complete list of variables in the dataset is as follows:

Variables List

Training Options

--exp_type: "pristine" or "clear_sky" for training on the respective atmospheric conditions.
--target_type: "longwave" (thermal) or "shortwave" (solar) for training on the respective radiation type targets.
--target_variable: "Fluxes" or "Heating-rate" for training on profiles of fluxes or heating rates.
--model: ML model architecture to select for training (MLP, GCN, GN, CNN)
--workers: The number of workers to use for dataloading/multi-processing.
--device: "cuda" or "cpu" to use GPUs or not.
--load_train_into_mem: Whether to load the training data into memory (can speed up training)
--load_val_into_mem: Whether to load the validation data into memory (can speed up training)
--lr: The learning rate to use for training.
--epochs: Number of epochs to train the model for.
--optim: The choice of optimizer to use (e.g. Adam)
--scheduler: The learning rate scheduler used for training (expdecay, reducelronplateau, steplr, cosine).
--weight_decay: Weight decay to use for the optimization process.
--batch_size: Batch size for training.
--act: Activation function (e.g. ReLU, GeLU, ...).
--hidden_dims: The hidden dimensionalities to use for the model (e.g. 128 128).
--dropout: Dropout rate to use for parameters.
--loss: Loss function to train the model with (MSE recommended).
--in_normalize: Select how to normalize the data (Z, min_max, None). Z-scaling is recommended.
--net_norm: Normalization scheme to use in the model (batch_norm, layer_norm, instance_norm)
--gradient_clipping: If "norm", the L2-norm of the parameters is clipped the value of --clip. Otherwise no clipping.
--clip: Value to clip the gradient to while training.
--val_metric: Which metric to use for saving the 'best' model based on validation set. Default: "RMSE"
--gap: Use global average pooling in-place of MLP to get output (CNN only).
--learn_edge_structure: If --model=='GCN': Whether to use a L-GCN (if set) with learnable adjacency matrix, or a GCN.
--train_years: The years to select for training the data. (Either individual years 1997+1991 or range 1991-1996)
--validation_years: The years to select for validating the data. Recommended: "2005" or "2005-06" 
--test_ood_1991: Whether to load and test on OOD data from 1991 (Mt. Pinatubo; especially challenging for clear-sky conditions)
--test_ood_historic: Whether to load and test on historic/pre-industrial OOD data from 1850-52.
--test_ood_future: Whether to load and test on future OOD data from 2097-99 (under a changing climate/radiative forcing)
--wandb_model: If "online", Weights&Biases logging. If "disabled" no logging.
--expID: A unique ID for the experiment if using logging.

Reproducing our Baselines

To reproduce our paper results (for seed = 7) you may run the following commands in a shell.

CNN

python main.py --model "CNN" --exp_type "pristine" --target_type "shortwave" --workers 6 --seed 7 \
  --batch_size 128 --lr 2e-4 --optim Adam --weight_decay 1e-6 --scheduler "expdecay" \
  --in_normalize "Z" --net_norm "none" --dropout 0.0 --act "GELU" --epochs 100 \
  --gap --gradient_clipping "norm" --clip 1.0 \
  --train_years "1990+1999+2003" --validation_years "2005" \
  --wandb_mode disabled

MLP

python main.py --model "MLP" --exp_type "pristine" --target_type "shortwave" --workers 6 --seed 7 \
  --batch_size 128 --lr 2e-4 --optim Adam --weight_decay 1e-6 --scheduler "expdecay" \
  --in_normalize "Z" --net_norm "layer_norm" --dropout 0.0 --act "GELU" --epochs 100 \
  --gradient_clipping "norm" --clip 1.0 --hidden_dims 512 256 256 \
  --train_years "1990+1999+2003" --validation_years "2005" \
  --wandb_mode disabled

GCN

python main.py --model "GCN+Readout" --exp_type "pristine" --target_type "shortwave" --workers 6 --seed 7 \
  --batch_size 128 --lr 2e-4 --optim Adam --weight_decay 1e-6 --scheduler "expdecay" \
  --in_normalize "Z" --net_norm "layer_norm" --dropout 0.0 --act "GELU" --epochs 100 \
  --preprocessing "mlp_projection" --projector_net_normalization "layer_norm" --graph_pooling "mean"\
  --residual --improved_self_loops \
  --gradient_clipping "norm" --clip 1.0 --hidden_dims 128 128 128 \  
  --train_years "1990+1999+2003" --validation_years "2005" \
  --wandb_mode disabled

Logging

Currently, logging is disabled by default. However, the user may use wandb to log the experiments by passing the argument --wandb_mode=online

Notebooks

There are some jupyter notebooks in the notebooks folder which we used for plotting, benchmarking etc. You may go through them to visualize the results/benchmark the models.

License:

This work is made available under Attribution 4.0 International (CC BY 4.0) license. CC BY 4.0

Development

This repository is currently under active development and you may encounter bugs with some functionality. Any feedback, extensions & suggestions are welcome!

Citation

If you find ClimART or this repository helpful, feel free to cite our publication:

@inproceedings{cachay2021climart,
    title={{ClimART}: A Benchmark Dataset for Emulating Atmospheric Radiative Transfer in Weather and Climate Models},
    author={Salva R{\"u}hling Cachay and Venkatesh Ramesh and Jason N. S. Cole and Howard Barker and David Rolnick},
    booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
    year={2021},
    url={https://openreview.net/forum?id=FZBtIpEAb5J}
}
coldcuts is an R package to automatically generate and plot segmentation drawings in R

coldcuts coldcuts is an R package that allows you to draw and plot automatically segmentations from 3D voxel arrays. The name is inspired by one of It

2 Sep 03, 2022
PyTorch code of "SLAPS: Self-Supervision Improves Structure Learning for Graph Neural Networks"

SLAPS-GNN This repo contains the implementation of the model proposed in SLAPS: Self-Supervision Improves Structure Learning for Graph Neural Networks

60 Dec 22, 2022
Github project for Attention-guided Temporal Coherent Video Object Matting.

Attention-guided Temporal Coherent Video Object Matting This is the Github project for our paper Attention-guided Temporal Coherent Video Object Matti

71 Dec 19, 2022
Official implementation of "Motif-based Graph Self-Supervised Learning forMolecular Property Prediction"

Motif-based Graph Self-Supervised Learning for Molecular Property Prediction Official Pytorch implementation of NeurIPS'21 paper "Motif-based Graph Se

zaixi 71 Dec 20, 2022
EgoNN: Egocentric Neural Network for Point Cloud Based 6DoF Relocalization at the City Scale

EgonNN: Egocentric Neural Network for Point Cloud Based 6DoF Relocalization at the City Scale Paper: EgoNN: Egocentric Neural Network for Point Cloud

19 Sep 20, 2022
ONNX-GLPDepth - Python scripts for performing monocular depth estimation using the GLPDepth model in ONNX

ONNX-GLPDepth - Python scripts for performing monocular depth estimation using the GLPDepth model in ONNX

Ibai Gorordo 18 Nov 06, 2022
DLFlow is a deep learning framework.

DLFlow是一套深度学习pipeline,它结合了Spark的大规模特征处理能力和Tensorflow模型构建能力。利用DLFlow可以快速处理原始特征、训练模型并进行大规模分布式预测,十分适合离线环境下的生产任务。利用DLFlow,用户只需专注于模型开发,而无需关心原始特征处理、pipeline构建、生产部署等工作。

DiDi 152 Oct 27, 2022
An All-MLP solution for Vision, from Google AI

MLP Mixer - Pytorch An All-MLP solution for Vision, from Google AI, in Pytorch. No convolutions nor attention needed! Yannic Kilcher video Install $ p

Phil Wang 784 Jan 06, 2023
Class activation maps for your PyTorch models (CAM, Grad-CAM, Grad-CAM++, Smooth Grad-CAM++, Score-CAM, SS-CAM, IS-CAM, XGrad-CAM, Layer-CAM)

TorchCAM: class activation explorer Simple way to leverage the class-specific activation of convolutional layers in PyTorch. Quick Tour Setting your C

F-G Fernandez 1.2k Dec 29, 2022
Some toy examples of score matching algorithms written in PyTorch

toy_gradlogp This repo implements some toy examples of the following score matching algorithms in PyTorch: ssm-vr: sliced score matching with variance

Ending Hsiao 21 Dec 26, 2022
SphereFace: Deep Hypersphere Embedding for Face Recognition

SphereFace: Deep Hypersphere Embedding for Face Recognition By Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj and Le Song License SphereFa

Weiyang Liu 1.5k Dec 29, 2022
Hierarchical Aggregation for 3D Instance Segmentation (ICCV 2021)

HAIS Hierarchical Aggregation for 3D Instance Segmentation (ICCV 2021) by Shaoyu Chen, Jiemin Fang, Qian Zhang, Wenyu Liu, Xinggang Wang*. (*) Corresp

Hust Visual Learning Team 145 Jan 05, 2023
A containerized REST API around OpenAI's CLIP model.

OpenAI's CLIP — REST API This is a container wrapping OpenAI's CLIP model in a RESTful interface. Running the container locally First, build the conta

Santiago Valdarrama 48 Nov 06, 2022
Expand human face editing via Global Direction of StyleCLIP, especially to maintain similarity during editing.

Oh-My-Face This project is based on StyleCLIP, RIFE, and encoder4editing, which aims to expand human face editing via Global Direction of StyleCLIP, e

AiLin Huang 51 Nov 17, 2022
CLUES: Few-Shot Learning Evaluation in Natural Language Understanding

CLUES: Few-Shot Learning Evaluation in Natural Language Understanding This repo contains the data and source code for baseline models in the NeurIPS 2

Microsoft 29 Dec 29, 2022
A basic duplicate image detection service using perceptual image hash functions and nearest neighbor search, implemented using faiss, fastapi, and imagehash

Duplicate Image Detection Getting Started Install dependencies pip install -r requirements.txt Run service python main.py Testing Test with pytest How

Matthew Podolak 21 Nov 11, 2022
Official Repo for ICCV2021 Paper: Learning to Regress Bodies from Images using Differentiable Semantic Rendering

[ICCV2021] Learning to Regress Bodies from Images using Differentiable Semantic Rendering Getting Started DSR has been implemented and tested on Ubunt

Sai Kumar Dwivedi 83 Nov 27, 2022
Meshed-Memory Transformer for Image Captioning. CVPR 2020

M²: Meshed-Memory Transformer This repository contains the reference code for the paper Meshed-Memory Transformer for Image Captioning (CVPR 2020). Pl

AImageLab 422 Dec 28, 2022
Arabic Car License Recognition. A solution to the kaggle competition Machathon 3.0.

Transformers Arabic licence plate recognition 🚗 Solution to the kaggle competition Machathon 3.0. Ranked in the top 6️⃣ at the final evaluation phase

Noran Hany 17 Dec 04, 2022
Cortex-compatible model server for Python and TensorFlow

Nucleus model server Nucleus is a model server for TensorFlow and generic Python models. It is compatible with Cortex clusters, Kubernetes clusters, a

Cortex Labs 14 Nov 27, 2022