PhysicsNeMo Curator is an accelerated ETL toolkit for building AI-ready datasets across multiple scientific and engineering domains, including CAE, weather/climate, molecular dynamics, and more. Curator is designed to be a flexible and customizable package that provides core pipeline components for users to create their own data processing pipelines.
Docs | Getting Started | Domains | Extending | Examples | Contributing
Warning
This package is in beta and subject to extensive changes. There are no guarantees for API stability.
- Fluent pipeline API — chain
Source → Filter → Sinkwith a single expression, then execute in parallel - Lazy generator semantics — sources and filters yield items lazily;
pipeline[i]processes only the i-th item - Multiple domains — first-class support for unstructured meshes
(
physicsnemo.mesh.Mesh), gridded data arrays (xarray.DataArray), and atomic/molecular data (nvalchemi.data.AtomicData) - Pluggable execution — sequential, thread pool, process pool, Loky, Dask, or Prefect backends
- Registry & CLI — all sources, filters, and sinks are discoverable via a global registry and optional interactive CLI
- Extensible — write custom sources, filters, and sinks with minimal boilerplate (guide)
- Python >= 3.11
- OS: Linux x86_64
- Rust toolchain (for building the native extension from source)
git clone git@github.com:NVIDIA/physicsnemo-curator.git
cd physicsnemo-curator
# Install all dev dependencies and build the Rust extension
uv sync --group dev
uv run maturin develop
# (Optional) Install pre-commit hooks
uv run pre-commit installCurate a simple global weather dataset:
# First install the data array dependency group
uv sync --extra dafrom datetime import datetime, timedelta
from physicsnemo_curator.domains.da.filters.stats import DataArrayStatsFilter
from physicsnemo_curator.domains.da.sinks.zarr_writer import ZarrSink
from physicsnemo_curator.domains.da.sources.era5 import ERA5Source
from physicsnemo_curator.run import run_pipeline
# Hourly timestamps for one day
times = [datetime(2020, 1, 1) + timedelta(hours=h) for h in range(24)]
# Source → Filter → Sink
pipeline = (
ERA5Source(times=times, variables=["u10m", "v10m", "t2m"], backend="arco")
.filter(DataArrayStatsFilter(output="output/stats.zarr", dims=("time",)))
.write(ZarrSink(output_path="output/dataset.zarr"))
)
# Execute in parallel
results = run_pipeline(pipeline, n_jobs=4, backend="process_pool")Install domain-specific extras as needed:
# Mesh domain (CAE, CFD)
pip install physicsnemo-curator[mesh]
# DataArray domain (weather/climate)
pip install physicsnemo-curator[da]
# Atomic domain (molecular dynamics)
pip install physicsnemo-curator[atm]
# Dashboard
pip install physicsnemo-curator[dashboard]PhysicsNeMo Curator includes the psnc command-line tool with an
interactive full-screen pipeline wizard powered by Textual.
psncpip install 'physicsnemo-curator[dashboard]'
psnc dashboard pipeline.dbPhysicsNeMo Curator is an open source project and its success is rooted in community contributions. Thank you for contributing so others can build on your work.
For guidance, please refer to the contributing guidelines. See also:
- Extending / Customization — how to write custom sources, filters, and sinks
- Developer Guide — style conventions, benchmarking, and AI-assisted development
PhysicsNeMo Curator is part of NVIDIA's open-source Physics-ML ecosystem:
| Package | Description |
|---|---|
| PhysicsNeMo | Core framework for building, training, and fine-tuning physics-ML models |
| PhysicsNeMo CFD | Pretrained AI models for computational fluid dynamics |
| Earth-2 Studio | Pretrained AI models for weather and climate |
| ALCHEMI Toolkit | GPU-first framework for AI-driven atomic simulations |
| ALCHEMI Toolkit Ops | GPU-optimized primitives for neighbor lists, dispersion, and electrostatics |
- GitHub Discussions — new data formats, transformations, Physics-ML research
- GitHub Issues — bug reports, feature requests, installation issues
PhysicsNeMo Curator is provided under the Apache License 2.0. See LICENSE.txt for the full license text.