import icechunk
import obstore
import fnmatch
from virtualizarr import open_virtual_dataset, open_virtual_mfdataset
from virtualizarr.parsers import HDFParser
from virtualizarr.registry import ObjectStoreRegistry
import xarray as xr
Generate Virtual Zarr from CMIP6 NetCDF files
Run this notebook
You can launch this notebook in VEDA JupyterHub by clicking the link below.
Launch in VEDA JupyterHub (requires access)
Learn more
Inside the Hub
This notebook was written on a VEDA JupyterHub instance using the default modified Pangeo image which contains all necessary libraries.
See VEDA Analytics JupyterHub Access for information about how to gain access.
Outside the Hub
You are welcome to run this anywhere you like (Note: alternatively you can run this on MAAP, locally, …), just make sure that you have a compatible environment and the data is accessible, or get in contact with the VEDA team to enable access.
Approach
This notebook demonstrates how to create a virtual Icechunk store reference for the AWS Open Data Registry of NASA Earth Exchange Global Daily Downscaled Projections (NEX-GDDP-CMIP6) NetCDF files on S3. Because the NetCDF files are publicly avaialble, this notebook should be runnable in any environment with the imported libraries, up until the last step where the Icechunk store is stored in the veda-data-store-staging S3 bucket, as that is a protected bucket.
To see how to publish a virtual Icechunk store to a STAC collection, see the Publishing CMIP6 Virtual Zarr to STAC notebook.
Step 1: Setup
Import necessary libraries and define some variables for which CMIP6 variable and model we will create references for.
Zarr can emit a lot of warnings about Numcodecs not being including in the Zarr version 3 specification yet – let’s suppress those.
import warnings
warnings.filterwarnings("ignore",
="Numcodecs codecs are not in the Zarr version 3 specification*",
message=UserWarning,
category )
Specify the CMIP model and variable to use. Here we are using near-surface air temperature from the GISS-E2-1-G GCM.
Note: We are only using the historical data in this example.
More years of data are available from multiple Shared Socio-Economic Pathways (SSPs) in the s3://nex-gddp-cmip6 bucket.
= "GISS-E2-1-G"
model = "tas"
variable
= "nex-gddp-cmip6"
s3_bucket = "us-west-2"
s3_region = True # set to false for a private bucket
s3_anonymous = "NEX-GDDP-CMIP6/GISS-E2-1-G/historical/"
s3_prefix = f"s3://{s3_bucket}/{s3_prefix}"
s3_path = f"{s3_prefix}r1i1p1*/{variable}/*" file_pattern
Step 2: Initiate file systems for reading
We can use Obstore’s obstore.store.from_url
convenience method to create an ObjectStore
that can fetch data from the specified URLs.
= obstore.store.from_url(f"s3://{s3_bucket}/", region="us-west-2", skip_signature=True) store
Step 3: Discover files from S3
List all available files for the model and then filter them to those that match the specified variable:
= obstore.list(store, prefix=s3_prefix)
pages_of_files
= []
all_files async for files in pages_of_files:
+= [f["path"] for f in files if fnmatch.fnmatch(f["path"], file_pattern)]
all_files
print(f"{len(all_files)} discovered matching s3://{s3_bucket}/{file_pattern}")
130 discovered matching s3://nex-gddp-cmip6/NEX-GDDP-CMIP6/GISS-E2-1-G/historical/r1i1p1*/tas/*
Step 4: Open a single file as a virtual dataset
Now, let’s create a virtual dataset by passing the URL, a parser instance, and an ObjectStoreRegistry instance to virtualizarr.open_virtual_dataset
.
We’ll try it with a single file first to make sure it works and get a sense of how long it takes to scan the NetCDF.
%%time
= open_virtual_dataset(
vds =f"s3://{s3_bucket}/{all_files[0]}",
url=HDFParser(),
parser=ObjectStoreRegistry({f"s3://{s3_bucket}/": store}),
registry
)print(vds)
<xarray.Dataset> Size: 1GB
Dimensions: (time: 365, lat: 600, lon: 1440)
Coordinates:
* time (time) object 3kB 1950-01-01 12:00:00 ... 1950-12-31 12:00:00
* lat (lat) float64 5kB -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88
* lon (lon) float64 12kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
Data variables:
tas (time, lat, lon) float32 1GB ManifestArray<shape=(365, 600, 1440...
Attributes: (12/23)
downscalingModel: BCSD
activity: NEX-GDDP-CMIP6
contact: Dr. Rama Nemani: rama.nemani@nasa.gov, Dr. Bridget...
Conventions: CF-1.7
creation_date: 2021-10-04T18:41:40.796912+00:00
frequency: day
... ...
history: 2021-10-04T18:41:40.796912+00:00: install global a...
disclaimer: This data is considered provisional and subject to...
external_variables: areacella
cmip6_source_id: GISS-E2-1-G
cmip6_institution_id: NASA-GISS
cmip6_license: CC-BY-SA 4.0
CPU times: user 4.29 s, sys: 339 ms, total: 4.62 s
Wall time: 6.2 s
Step 5: Open all the files as a virtual multifile dataset
Just like we can open one file as a virtual dataset using virtualizarr.open_virtual_dataset
, we can open multiple data sources into a single virtual dataset using virtualizarr.open_virtual_mfdataset
, similar to how xarray.open_mfdataset
opens multiple data files as a single dataset.
Note this takes some time because VirtualiZarr needs to scan each file to extract its metadata. We can expect it to take at least 130 * whatever the last cell took.
%%time
= open_virtual_mfdataset(
vds f"s3://{s3_bucket}/{path}" for path in all_files],
[=HDFParser(),
parser=ObjectStoreRegistry({f"s3://{s3_bucket}/": store}),
registry
)print(vds)
<xarray.Dataset> Size: 82GB
Dimensions: (time: 23725, lat: 600, lon: 1440)
Coordinates:
* time (time) object 190kB 1950-01-01 12:00:00 ... 2014-12-31 12:00:00
* lat (lat) float64 5kB -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88
* lon (lon) float64 12kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
Data variables:
tas (time, lat, lon) float32 82GB ManifestArray<shape=(23725, 600, 1...
Attributes: (12/22)
downscalingModel: BCSD
activity: NEX-GDDP-CMIP6
Conventions: CF-1.7
frequency: day
institution: NASA Earth Exchange, NASA Ames Research Center, Mo...
variant_label: r1i1p1f2
... ...
cmip6_institution_id: NASA-GISS
cmip6_license: CC-BY-SA 4.0
contact: Dr. Bridget Thrasher: bridget@climateanalyticsgrou...
creation_date: Sat Nov 16 21:42:29 PST 2024
disclaimer: These data are considered provisional and subject ...
tracking_id: ba64ebba-3c17-4348-b463-7e6ae7d11769
CPU times: user 16.8 s, sys: 18.3 s, total: 35.1 s
Wall time: 5min 30s
The size represents the size of the original dataset - you can see the size of the virtual dataset using the vz accessor:
print(f"Original dataset size: {vds.nbytes / 1000**3:.2f} GB")
print(f"Virtual dataset size: {vds.vz.nbytes / 1000:.2f} kB")
Original dataset size: 81.99 GB
Virtual dataset size: 965.32 kB
The magic of virtual Zarr is that you can persist the virtual dataset to disk in a chunk references format such as Icechunk or kerchunk, meaning that the work of constructing the single coherent dataset only needs to happen once. For subsequent data access, you can use xarray.open_zarr
to open that virtual Zarr store, which on object storage is far faster than using xarray.open_mfdataset
to open the the original non-cloud-optimized files.
Step 6: Write to Icechunk
Let’s persist the virtual dataset using Icechunk. Here there are two different options for storing the dataset. If you don’t have sufficient credentials you can store it in memory. Otherwise you can store the virtual dataset in the veda-data-store-staging
bucket on s3.
= icechunk.s3_storage(
icechunk_store ="veda-data-store-staging",
bucket="us-west-2",
region=f"cmip6-{model}-{variable}-historical-icechunk",
prefix=True,
from_env )
= icechunk.in_memory_storage() icechunk_store
Did you pick where to write to?
Good! Let’s do it!
= icechunk.RepositoryConfig.default()
config
config.set_virtual_chunk_container(
icechunk.VirtualChunkContainer(=s3_path,
url_prefix=icechunk.s3_store(region=s3_region, anonymous=s3_anonymous),
store
),
)
if s3_anonymous:
= icechunk.containers_credentials(
virtual_credentials
{s3_path: icechunk.s3_anonymous_credentials()}
)
= icechunk.Repository.create(
repo
icechunk_store,
config,=virtual_credentials,
authorize_virtual_chunk_access
)
= repo.writable_session("main")
session
vds.vz.to_icechunk(session.store)"Create virtual store") session.commit(
'SHSZN93MQCNF93SE7M30'
To make sure it worked, let’s take a look at the bucket:
!aws s3 ls s3://veda-data-store-staging/cmip6-GISS-E2-1-G-tas-historical-icechunk/
PRE manifests/
PRE refs/
PRE snapshots/
PRE transactions/
2025-09-18 16:23:48 447 config.yaml
What about kerchunk?
NOTE: We tend to prefer Icechunk because it is scalable, ensures ACID transactions and has built-in version control. But you can write virtual datasets to kerchunk. Here is how you can write a local kerchunk-json file:
vds.vz.to_kerchunk(=f"combined_CMIP6_daily_{model}_{variable}_kerchunk.json",
filepathformat="json",
)
Final Step: Check your work
It is always a good idea to check your work when creating a new virtual dataset by comparing the virtual version to the native version. In this case let’s check the version on s3 so we can compare the times it takes to lazily open the virtual dataset compared to the native version.
%%time
= icechunk.s3_storage(
storage ="veda-data-store-staging",
bucket="us-west-2",
region=f"cmip6-{model}-{variable}-historical-icechunk",
prefix=True,
from_env
)
= icechunk.RepositoryConfig.default()
config
config.set_virtual_chunk_container(
icechunk.VirtualChunkContainer(
s3_path,=s3_region)
icechunk.s3_store(region
)
)
if s3_anonymous:
= icechunk.containers_credentials(
virtual_credentials
{
s3_path: icechunk.s3_anonymous_credentials()
}
)
= icechunk.Repository.open(
repo =storage,
storage=config,
config=virtual_credentials,
authorize_virtual_chunk_access
)= repo.readonly_session(branch="main")
session
= xr.open_zarr(session.store)
observed observed
CPU times: user 157 ms, sys: 0 ns, total: 157 ms
Wall time: 391 ms
<xarray.Dataset> Size: 82GB Dimensions: (time: 23725, lat: 600, lon: 1440) Coordinates: * lat (lat) float64 5kB -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88 * lon (lon) float64 12kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9 * time (time) object 190kB 1950-01-01 12:00:00 ... 2014-12-31 12:00:00 Data variables: tas (time, lat, lon) float32 82GB dask.array<chunksize=(1, 600, 1440), meta=np.ndarray> Attributes: (12/22) downscalingModel: BCSD activity: NEX-GDDP-CMIP6 Conventions: CF-1.7 frequency: day institution: NASA Earth Exchange, NASA Ames Research Center, Mo... variant_label: r1i1p1f2 ... ... cmip6_institution_id: NASA-GISS cmip6_license: CC-BY-SA 4.0 contact: Dr. Bridget Thrasher: bridget@climateanalyticsgrou... creation_date: Sat Nov 16 21:42:29 PST 2024 disclaimer: These data are considered provisional and subject ... tracking_id: ba64ebba-3c17-4348-b463-7e6ae7d11769
Make sure you can access the underlying chunk data. Let’s time this as well so we can compare it to the native version
%%time
= observed.tas.sel(time="2010", lat=slice(39, 41), lon=slice(284, 286))
data_slice ="time").compute().plot() data_slice.mean(dim
CPU times: user 7.15 s, sys: 1.45 s, total: 8.6 s
Wall time: 7.45 s
Compare that with the traditional method of opening the NetCDF files directly. We’ll tweak the fsspec configuration to fetch larger blocks at a time.
import fsspec
= fsspec.filesystem("s3", anon=True)
fs
= {
fsspec_kwargs "cache_type": "blockcache", # block cache stores blocks of fixed size and uses eviction using a LRU strategy.
"block_size": 8 * 1024 * 1024, # size in bytes per block, depends on the file size but recommendation is in the MBs
}
Now we can lazily open all the NetCDF files.
%%time
= xr.open_mfdataset(
expected open(f"s3://{s3_bucket}/{path}" , **fsspec_kwargs) for path in all_files],
[fs.="h5netcdf"
engine
) expected
CPU times: user 10.6 s, sys: 2.5 s, total: 13.1 s
Wall time: 57.3 s
<xarray.Dataset> Size: 82GB Dimensions: (time: 23725, lat: 600, lon: 1440) Coordinates: * time (time) object 190kB 1950-01-01 12:00:00 ... 2014-12-31 12:00:00 * lat (lat) float64 5kB -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88 * lon (lon) float64 12kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9 Data variables: tas (time, lat, lon) float32 82GB dask.array<chunksize=(1, 600, 1440), meta=np.ndarray> Attributes: (12/22) downscalingModel: BCSD activity: NEX-GDDP-CMIP6 Conventions: CF-1.7 frequency: day institution: NASA Earth Exchange, NASA Ames Research Center, Mo... variant_label: r1i1p1f2 ... ... cmip6_institution_id: NASA-GISS cmip6_license: CC-BY-SA 4.0 contact: Dr. Bridget Thrasher: bridget@climateanalyticsgrou... creation_date: Sat Nov 16 21:42:29 PST 2024 disclaimer: These data are considered provisional and subject ... tracking_id: ba64ebba-3c17-4348-b463-7e6ae7d11769
Compare the time it takes to lazily open the NetCDF files (~1 minute) with the time it took to lazily open the virtual icechunk store (~300 ms).
Now let’s take a slice of the data
%%time
= expected.tas.sel(time="2010", lat=slice(39, 41), lon=slice(284, 286))
data_slice ="time").compute().plot() data_slice.mean(dim
CPU times: user 8.41 s, sys: 156 ms, total: 8.57 s
Wall time: 9.78 s
Interestingly that takes about the same amount of time (7s for the virtual store, 9s for the native files). That makes sense because the data is coming from the same location on disk.