!pip install xstac "xpystac>=0.4.0" --quiet
Publishing CMIP6 Virtual Zarr to STAC
Run this notebook
You can launch this notebook in VEDA JupyterHub by clicking the link below.
Launch in VEDA JupyterHub (requires access)
Learn more
Inside the Hub
This notebook was written on a VEDA JupyterHub instance using the default modified Pangeo image which contains all necessary libraries except those explicitly installed in the Setup Section.
See VEDA Analytics JupyterHub Access for information about how to gain access.
Outside the Hub
You are welcome to run this anywhere you like (Note: alternatively you can run this on MAAP, locally, …), just make sure that you have a compatible environment and the data is accessible, or get in contact with the VEDA team to enable access.
Approach
This notebook creates STAC collection metadata for a CMIP6 virtual Icechunk store which has already been generated and stored in S3. It was run on the VEDA JupyterHub and since veda-data-store-staging
is a protected bucket it is not expected to work in an environment without access to that bucket.
This notebook serves as documentation for the publication of the CMIP6 virtual Icechunk store. It demonstrates how to create a STAC Collection for a virtual Icechunk store using the following extensions:
NOTE: This notebook will not entirely generalize for arbitrary virtual Zarr datasets but may be a helpful example. In particular the bucket that stores the virtual Icechunk or the data itself were not publicly accessible then you could also use the Authentication STAC Extension to specify how to access those buckets.
Step 1: Setup
Install and import necessary libraries. These are in addition to all the libraries already included in the default modified Pangeo image.
import icechunk
import json
import pystac
import xstac
import xarray as xr
Zarr can emit a lot of warnings about Numcodecs not being including in the Zarr version 3 specification yet – let’s suppress those.
import warnings
warnings.filterwarnings("ignore",
="Numcodecs codecs are not in the Zarr version 3 specification*",
message=UserWarning,
category )
Next we’ll specify the where the virtual Icechunk store lives on s3 and where the actual CMIP6 NetCDF files live on s3.
Basically these are the pieces of information you need when opening the virtual Icechunk store to allow access via xr.open_zarr
. These are the minimum information that we need to capture at the STAC level.
= "GISS-E2-1-G"
model = "tas"
variable
= "veda-data-store-staging"
s3_bucket_virtual = f"cmip6-{model}-{variable}-historical-icechunk"
s3_prefix_virtual = "us-west-2"
s3_region_virtual = False
s3_anonymous_virtual
= "nex-gddp-cmip6"
s3_bucket_data = f"NEX-GDDP-CMIP6/{model}/historical/"
s3_prefix_data = "us-west-2"
s3_region_data = f"s3://{s3_bucket_data}/{s3_prefix_data}"
s3_path_data = True s3_anonymous_data
Step 2: Open the dataset with xarray
To open the dataset with xarray we need to open a the Icechunk store with the proper access credentials.
= icechunk.s3_storage(
storage =s3_bucket_virtual,
bucket=s3_prefix_virtual,
prefix=s3_region_virtual,
region=s3_anonymous_virtual,
anonymous=not s3_anonymous_virtual,
from_env
)= icechunk.RepositoryConfig.default()
config
config.set_virtual_chunk_container(
icechunk.VirtualChunkContainer(
s3_path_data,=s3_region_data)
icechunk.s3_store(region
)
)
if s3_anonymous_data:
= icechunk.containers_credentials(
virtual_credentials
{
s3_path_data: icechunk.s3_anonymous_credentials()
}
)
= icechunk.Repository.open(
repo =storage,
storage=config,
config=virtual_credentials,
authorize_virtual_chunk_access
)= repo.readonly_session(branch="main") session
Once the session has been created we can use the store instance directly with xr.open_zarr
.
%%time
= xr.open_zarr(session.store)
ds ds
CPU times: user 2.4 s, sys: 208 ms, total: 2.61 s
Wall time: 2.69 s
<xarray.Dataset> Size: 82GB Dimensions: (time: 23725, lat: 600, lon: 1440) Coordinates: * lat (lat) float64 5kB -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88 * lon (lon) float64 12kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9 * time (time) object 190kB 1950-01-01 12:00:00 ... 2014-12-31 12:00:00 Data variables: tas (time, lat, lon) float32 82GB dask.array<chunksize=(1, 600, 1440), meta=np.ndarray> Attributes: (12/22) downscalingModel: BCSD activity: NEX-GDDP-CMIP6 Conventions: CF-1.7 frequency: day institution: NASA Earth Exchange, NASA Ames Research Center, Mo... variant_label: r1i1p1f2 ... ... cmip6_institution_id: NASA-GISS cmip6_license: CC-BY-SA 4.0 contact: Dr. Bridget Thrasher: bridget@climateanalyticsgrou... creation_date: Sat Nov 16 21:42:29 PST 2024 disclaimer: These data are considered provisional and subject ... tracking_id: ba64ebba-3c17-4348-b463-7e6ae7d11769
Step 3: Generate STAC metadata
Now let’s see how xstac can help us extract the variables and represent them using the Datacube STAC Extension.
We’ll start with some of the hard-coded values that will need to be provided for any given dataset. As much as possible these should be lifted directly from the xr.Dataset
attrs.
= "CMIP6_daily_GISS-E2-1-G_tas"
collection_id = f"Experimental virtual Icechunk store representing historical {model} data for {variable}"
description = [
providers
pystac.Provider(="NASA-GISS",
name=[pystac.ProviderRole.PRODUCER, pystac.ProviderRole.PROCESSOR, pystac.ProviderRole.LICENSOR],
roles="https://data.giss.nasa.gov/modelE/cmip6/"
url
),
pystac.Provider(="VEDA",
name=[pystac.ProviderRole.PROCESSOR, pystac.ProviderRole.HOST],
roles="https://github.com/nasa-impact/veda-data-pipelines",
url
) ]
I want to draw special attention to how we can use the Storage STAC Extension to capture whether or not the s3 bucket where the virtual Icechunk store lives and the bucket where the data live can be accessed anonymously.
= {
storage_schemes f"aws-s3-{s3_bucket_virtual}": {
"type": "aws-s3",
"platform": "https://{bucket}.s3.{region}.amazonaws.com",
"bucket": s3_bucket_virtual,
"region": s3_region_virtual,
"anonymous": s3_anonymous_virtual,
},f"aws-s3-{s3_bucket_data}": {
"type": "aws-s3",
"platform": "https://{bucket}.s3.{region}.amazonaws.com",
"bucket": s3_bucket_data,
"region": s3_region_data,
"anonymous": s3_anonymous_data,
} }
Create an asset representing the data bucket.
In the asset definition we will include a reference to the storage scheme that we defined above. That way we’ll know how to access the bucket.
= f"{collection_id}-data-bucket"
data_asset_key = pystac.Asset(
data_asset
s3_path_data,=f"{collection_id} Data Bucket",
title="application/x-netcdf",
media_type=["data"],
roles={
extra_fields"storage:refs": [f"aws-s3-{s3_bucket_data}"],
} )
Create an asset representing the the virtual Icechunk store.
The main concern of the asset is how to access the data. So we will include a reference to the storage scheme and also we will connect this asset with the data asset using the Virtual STAC Extension. The Virtual STAC Extention makes it clear that this is a virtual representation of data that is stored elsewhere and points to the asset where there data asset is defined.
= f"{collection_id}@{session.snapshot_id}"
virtual_asset_key = pystac.Asset(
virtual_asset f"s3://{s3_bucket_virtual}/{s3_prefix_virtual}/",
=f"{collection_id} Virtual Zarr Store",
title="application/vnd.zarr+icechunk", # TBD discussion https://earthmover-community.slack.com/archives/C07NQCBSTB7/p1756918042834049
media_type=["data", "references", "virtual", "latest-version"],
roles={
extra_fields"version": session.snapshot_id,
"storage:refs": [f"aws-s3-{s3_bucket_virtual}"],
"vrt:hrefs": [
{"key": data_asset_key,
"href": f"https://staging.openveda.cloud/api/stac/collections/{collection_id}#/assets/{data_asset_key}"
},
],
} )
Now let’s configure some metadata that can be gotten from the xr.Dataset
itself. We’ll leave the time bounds off for now since those will be handled by xstac
.
= ds.attrs["title"]
title = pystac.Extent(
extent =pystac.SpatialExtent(bboxes=[list(ds.rio.set_spatial_dims("lon", "lat").rio.bounds())]),
spatial=pystac.TemporalExtent([[None, None]])
temporal )
Now that we have all those values set, create a template pystac.Collection
:
= pystac.Collection(
template
collection_id,=description,
description=extent,
extent={"storage:schemes": storage_schemes},
extra_fields=providers,
providers=title,
title={virtual_asset_key: virtual_asset, data_asset_key: data_asset},
assets=[
stac_extensions"https://stac-extensions.github.io/storage/v2.0.0/schema.json",
"https://stac-extensions.github.io/virtual-assets/v1.0.0/schema.json",
"https://stac-extensions.github.io/version/v1.2.0/schema.json",
],="CC0-1.0",
license )
That template is used by xstac
to generate additional metadata, such as the temporal extent and the Datacube STAC Extension
information.
= xstac.xarray_to_stac(
collection
ds,
template,="time",
temporal_dimension="lon",
x_dimension="lat",
y_dimension# TODO: get this from attributes if possible
="4326",
reference_system=False
validate )
# It should validate, yay!
collection.validate()
['https://schemas.stacspec.org/v1.1.0/collection-spec/json-schema/collection.json',
'https://stac-extensions.github.io/storage/v2.0.0/schema.json',
'https://stac-extensions.github.io/virtual-assets/v1.0.0/schema.json',
'https://stac-extensions.github.io/version/v1.2.0/schema.json',
'https://stac-extensions.github.io/datacube/v2.2.0/schema.json']
The VEDA STAC ingestor requires a few more fields so we can add those to the collection:
'data_type'] = 'icechunk'
collection.extra_fields['dashboard:is_periodic'] = "frequency" in ds.attrs
collection.extra_fields['dashboard:time_density'] = ds.attrs["frequency"] collection.extra_fields[
Write out the collection as json so you can inspect it and don’t need to generate it again:
with open("collection.json", "w") as f:
=2) json.dump(collection.to_dict(), f, indent
Step 4: Make sure the collection is usable
Now that we have defined a STAC collection and a collection-level asset we can get use those directly to open the virtual Icechunk store taking any necessary credentials into consideration.
= pystac.Collection.from_file("collection.json")
collection
# Get the latest version of the collection-level virtual asset
= collection.get_assets(role="latest-version")
assets = next(iter(assets.values()))
asset
xr.open_dataset(asset)
<xarray.Dataset> Size: 82GB Dimensions: (time: 23725, lat: 600, lon: 1440) Coordinates: * lon (lon) float64 12kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9 * lat (lat) float64 5kB -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88 * time (time) object 190kB 1950-01-01 12:00:00 ... 2014-12-31 12:00:00 Data variables: tas (time, lat, lon) float32 82GB ... Attributes: (12/22) downscalingModel: BCSD activity: NEX-GDDP-CMIP6 Conventions: CF-1.7 frequency: day institution: NASA Earth Exchange, NASA Ames Research Center, Mo... variant_label: r1i1p1f2 ... ... cmip6_institution_id: NASA-GISS cmip6_license: CC-BY-SA 4.0 contact: Dr. Bridget Thrasher: bridget@climateanalyticsgrou... creation_date: Sat Nov 16 21:42:29 PST 2024 disclaimer: These data are considered provisional and subject ... tracking_id: ba64ebba-3c17-4348-b463-7e6ae7d11769
Final Step - Publish the collection
The last step is to publish the client by opening a PR in veda-data or using the ingest UI. For the ingest UI, you will need to authenticate with Keycloak/Cilogon: https://docs.openveda.cloud/user-guide/data-services/setup-keycloak.html