High Performance Computing (HPC) and Dask¶
XRegrid is designed for high-performance regridding on both local workstations and large-scale HPC systems. This guide shows you how to leverage Dask to parallelize your regridding tasks.
Parallelization Strategies¶
XRegrid uses Dask to parallelize two distinct phases:
- Parallel Weight Generation: Large grids (e.g., 3km global) can take minutes to generate weights on a single core. XRegrid can distribute this process across a Dask cluster.
- Parallel Weight Application: For large datasets, applying the regridding weights can be parallelized using Dask chunks.
1. Local Dask Cluster¶
On a local machine with multiple cores, you can use a LocalCluster to parallelize your work.
import dask.distributed
from xregrid import Regridder
# Start a local cluster
client = dask.distributed.Client()
# Create regridder with parallel weight generation
regridder = Regridder(
ds_src, ds_tgt,
method='bilinear',
parallel=True # Distribute weight generation
)
# Apply to chunked data
ds_chunked = ds_src.chunk({'time': 10})
regridded = regridder(ds_chunked)
2. NOAA RDHPCS Systems (Hera, Jet, Gaea, Ursa)¶
XRegrid provides a dedicated helper for setting up SLURM clusters on NOAA's RDHPCS systems. It automatically detects your machine and provides reasonable defaults for each architecture.
from xregrid.utils import get_rdhpcs_cluster
from distributed import Client
# Automatically detect machine (Hera, Jet, Gaea, or Ursa)
# and set up a SLURM cluster
cluster = get_rdhpcs_cluster(account="your_slurm_account")
# Scale the cluster to use 4 SLURM jobs
cluster.scale(jobs=4)
client = Client(cluster)
# All regridding will now be distributed across the cluster
regridder = Regridder(ds_src, ds_tgt, parallel=True)
regridded = regridder(ds_src)
Supported Machines¶
| Machine | Name | Default Queue | Default Memory |
|---|---|---|---|
| Hera | hera |
hera |
160 GB / node |
| Jet | jet |
batch |
120 GB / node |
| Gaea | gaea-c5, gaea-c6 |
batch |
256/384 GB / node |
| Ursa | ursa |
u1-compute |
384 GB / node |
3. Parallel Weight Generation¶
When parallel=True is passed to the Regridder constructor, XRegrid:
- Splits the target grid into multiple chunks.
- Submits weight generation tasks to different Dask workers.
- Assembles weights on the cluster: To protect the driver from Out-Of-Memory (OOM) crashes, the weights are concatenated on a worker and stored as a Dask Future.
compute=True(default): Triggers calculation and assembly immediately.compute=False: Submits tasks but defers assembly until you callregridder.compute().
4. Large-Scale Memory Management¶
When working with massive grids and datasets, follow these best practices:
Use Weight Reuse¶
Always save your weights for large grids to avoid recomputing them.
regridder = Regridder(
ds_src, ds_tgt,
parallel=True,
reuse_weights=True,
filename='3km_weights.nc'
)
Chunk in Time, Not Space¶
For optimal performance, keep spatial dimensions unchunked and chunk along the time or level dimension.
- Good:
{'time': 20, 'lat': 720, 'lon': 1440} - Bad:
{'time': 20, 'lat': 100, 'lon': 100}
Worker-Local Caching¶
XRegrid automatically caches weights on Dask workers. When regridding multiple time steps, the weight matrix is only sent to each worker once, significantly reducing network overhead.
Troubleshooting Dask¶
- "Worker Memory Exceeded": Try reducing your chunk size in the time dimension.
- "Communication Timeout": For very large grids, increase the Dask communication timeout:
dask.config.set({"distributed.comm.timeouts.connect": "60s"}). - "Killed Worker": Ensure you are not chunking your spatial dimensions. ESMF regridding requires the full spatial domain to be available to each worker for a given time chunk.