Publishing CMIP6 Virtual Zarr to STAC

Run this notebook

You can launch this notebook in VEDA JupyterHub by clicking the link below.

Launch in VEDA JupyterHub (requires access)

Learn more

Inside the Hub

This notebook was written on a VEDA JupyterHub instance using the default modified Pangeo image which contains all necessary libraries except those explicitly installed in the Setup Section.

See VEDA Analytics JupyterHub Access for information about how to gain access.

Outside the Hub

You are welcome to run this anywhere you like (Note: alternatively you can run this on MAAP, locally, …), just make sure that you have a compatible environment and the data is accessible, or get in contact with the VEDA team to enable access.

Approach

This notebook creates STAC collection metadata for a CMIP6 virtual Icechunk store which has already been generated and stored in S3. It was run on the VEDA JupyterHub and since veda-data-store-staging is a protected bucket it is not expected to work in an environment without access to that bucket.

This notebook serves as documentation for the publication of the CMIP6 virtual Icechunk store. It demonstrates how to create a STAC Collection for a virtual Icechunk store using the following extensions:

NOTE: This notebook will not entirely generalize for arbitrary virtual Zarr datasets but may be a helpful example. In particular the bucket that stores the virtual Icechunk or the data itself were not publicly accessible then you could also use the Authentication STAC Extension to specify how to access those buckets.

Step 1: Setup

Install and import necessary libraries. These are in addition to all the libraries already included in the default modified Pangeo image.

!pip install xstac "xpystac>=0.4.0" --quiet

import icechunk
import json
import pystac
import xstac

import xarray as xr

Zarr can emit a lot of warnings about Numcodecs not being including in the Zarr version 3 specification yet – let’s suppress those.

import warnings

warnings.filterwarnings(
    "ignore",
    message="Numcodecs codecs are not in the Zarr version 3 specification*",
    category=UserWarning,
)

Next we’ll specify the where the virtual Icechunk store lives on s3 and where the actual CMIP6 NetCDF files live on s3.

Basically these are the pieces of information you need when opening the virtual Icechunk store to allow access via xr.open_zarr. These are the minimum information that we need to capture at the STAC level.

model = "GISS-E2-1-G"
variable = "tas"

s3_bucket_virtual = "veda-data-store-staging"
s3_prefix_virtual = f"cmip6-{model}-{variable}-historical-icechunk"
s3_region_virtual = "us-west-2"
s3_anonymous_virtual = False

s3_bucket_data = "nex-gddp-cmip6"
s3_prefix_data = f"NEX-GDDP-CMIP6/{model}/historical/"
s3_region_data = "us-west-2"
s3_path_data = f"s3://{s3_bucket_data}/{s3_prefix_data}"
s3_anonymous_data = True

Step 2: Open the dataset with xarray

To open the dataset with xarray we need to open a the Icechunk store with the proper access credentials.

storage = icechunk.s3_storage(
    bucket=s3_bucket_virtual,
    prefix=s3_prefix_virtual,
    region=s3_region_virtual,
    anonymous=s3_anonymous_virtual,
    from_env=not s3_anonymous_virtual,
)
config = icechunk.RepositoryConfig.default()
config.set_virtual_chunk_container(
    icechunk.VirtualChunkContainer(
        s3_path_data,
        icechunk.s3_store(region=s3_region_data)
    )
)

if s3_anonymous_data:
    virtual_credentials = icechunk.containers_credentials(
        {
            s3_path_data: icechunk.s3_anonymous_credentials()
        }
    )

repo = icechunk.Repository.open(
    storage=storage,
    config=config,
    authorize_virtual_chunk_access=virtual_credentials,
)
session = repo.readonly_session(branch="main")

Once the session has been created we can use the store instance directly with xr.open_zarr.

%%time
ds = xr.open_zarr(session.store)
ds

CPU times: user 2.4 s, sys: 208 ms, total: 2.61 s
Wall time: 2.69 s

<xarray.Dataset> Size: 82GB
Dimensions:  (time: 23725, lat: 600, lon: 1440)
Coordinates:
  * lat      (lat) float64 5kB -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88
  * lon      (lon) float64 12kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
  * time     (time) object 190kB 1950-01-01 12:00:00 ... 2014-12-31 12:00:00
Data variables:
    tas      (time, lat, lon) float32 82GB dask.array<chunksize=(1, 600, 1440), meta=np.ndarray>
Attributes: (12/22)
    downscalingModel:      BCSD
    activity:              NEX-GDDP-CMIP6
    Conventions:           CF-1.7
    frequency:             day
    institution:           NASA Earth Exchange, NASA Ames Research Center, Mo...
    variant_label:         r1i1p1f2
    ...                    ...
    cmip6_institution_id:  NASA-GISS
    cmip6_license:         CC-BY-SA 4.0
    contact:               Dr. Bridget Thrasher: bridget@climateanalyticsgrou...
    creation_date:         Sat Nov 16 21:42:29 PST 2024
    disclaimer:            These data are considered provisional and subject ...
    tracking_id:           ba64ebba-3c17-4348-b463-7e6ae7d11769

Step 3: Generate STAC metadata

Now let’s see how xstac can help us extract the variables and represent them using the Datacube STAC Extension.

We’ll start with some of the hard-coded values that will need to be provided for any given dataset. As much as possible these should be lifted directly from the xr.Dataset attrs.

collection_id = "CMIP6_daily_GISS-E2-1-G_tas"
description = f"Experimental virtual Icechunk store representing historical {model} data for {variable}"
providers = [
    pystac.Provider(
        name="NASA-GISS",
        roles=[pystac.ProviderRole.PRODUCER, pystac.ProviderRole.PROCESSOR, pystac.ProviderRole.LICENSOR],
        url="https://data.giss.nasa.gov/modelE/cmip6/"
    ),
    pystac.Provider(
        name="VEDA",
        roles=[pystac.ProviderRole.PROCESSOR, pystac.ProviderRole.HOST],
        url="https://github.com/nasa-impact/veda-data-pipelines",
    )
]

I want to draw special attention to how we can use the Storage STAC Extension to capture whether or not the s3 bucket where the virtual Icechunk store lives and the bucket where the data live can be accessed anonymously.

storage_schemes = {
    f"aws-s3-{s3_bucket_virtual}": {
        "type": "aws-s3",
        "platform": "https://{bucket}.s3.{region}.amazonaws.com",
        "bucket": s3_bucket_virtual,
        "region": s3_region_virtual,
        "anonymous": s3_anonymous_virtual,
    },
    f"aws-s3-{s3_bucket_data}": {
        "type": "aws-s3",
        "platform": "https://{bucket}.s3.{region}.amazonaws.com",
        "bucket": s3_bucket_data,
        "region": s3_region_data,
        "anonymous": s3_anonymous_data,
    }
}

Create an asset representing the data bucket.

In the asset definition we will include a reference to the storage scheme that we defined above. That way we’ll know how to access the bucket.

data_asset_key = f"{collection_id}-data-bucket"
data_asset = pystac.Asset(
    s3_path_data,
    title=f"{collection_id} Data Bucket",
    media_type="application/x-netcdf",
    roles=["data"],
    extra_fields={
        "storage:refs": [f"aws-s3-{s3_bucket_data}"],
    }
)

Create an asset representing the the virtual Icechunk store.

The main concern of the asset is how to access the data. So we will include a reference to the storage scheme and also we will connect this asset with the data asset using the Virtual STAC Extension. The Virtual STAC Extention makes it clear that this is a virtual representation of data that is stored elsewhere and points to the asset where there data asset is defined.

virtual_asset_key = f"{collection_id}@{session.snapshot_id}"
virtual_asset = pystac.Asset(
    f"s3://{s3_bucket_virtual}/{s3_prefix_virtual}/",
    title=f"{collection_id} Virtual Zarr Store",
    media_type="application/vnd.zarr+icechunk",  # TBD discussion https://earthmover-community.slack.com/archives/C07NQCBSTB7/p1756918042834049
    roles=["data", "references", "virtual", "latest-version"],
    extra_fields={
        "version": session.snapshot_id,
        "storage:refs": [f"aws-s3-{s3_bucket_virtual}"],
        "vrt:hrefs": [
            {
                "key": data_asset_key,
                "href": f"https://staging.openveda.cloud/api/stac/collections/{collection_id}#/assets/{data_asset_key}"
            },
        ],
    }
)

Now let’s configure some metadata that can be gotten from the xr.Dataset itself. We’ll leave the time bounds off for now since those will be handled by xstac.

title = ds.attrs["title"]
extent = pystac.Extent(
    spatial=pystac.SpatialExtent(bboxes=[list(ds.rio.set_spatial_dims("lon", "lat").rio.bounds())]),
    temporal=pystac.TemporalExtent([[None, None]])
)

Now that we have all those values set, create a template pystac.Collection:

template = pystac.Collection(
    collection_id,
    description=description,
    extent=extent,
    extra_fields={"storage:schemes": storage_schemes},
    providers=providers,
    title=title,
    assets={virtual_asset_key: virtual_asset, data_asset_key: data_asset},
    stac_extensions=[
        "https://stac-extensions.github.io/storage/v2.0.0/schema.json",
        "https://stac-extensions.github.io/virtual-assets/v1.0.0/schema.json",
        "https://stac-extensions.github.io/version/v1.2.0/schema.json",
    ],
    license="CC0-1.0",
)

That template is used by xstac to generate additional metadata, such as the temporal extent and the Datacube STAC Extension information.

collection = xstac.xarray_to_stac(
    ds,
    template,
    temporal_dimension="time",
    x_dimension="lon",
    y_dimension="lat",
    # TODO: get this from attributes if possible
    reference_system="4326",
    validate=False
)

# It should validate, yay!
collection.validate()

['https://schemas.stacspec.org/v1.1.0/collection-spec/json-schema/collection.json',
 'https://stac-extensions.github.io/storage/v2.0.0/schema.json',
 'https://stac-extensions.github.io/virtual-assets/v1.0.0/schema.json',
 'https://stac-extensions.github.io/version/v1.2.0/schema.json',
 'https://stac-extensions.github.io/datacube/v2.2.0/schema.json']

The VEDA STAC ingestor requires a few more fields so we can add those to the collection:

collection.extra_fields['data_type'] = 'icechunk'
collection.extra_fields['dashboard:is_periodic'] = "frequency" in ds.attrs
collection.extra_fields['dashboard:time_density'] = ds.attrs["frequency"]

Write out the collection as json so you can inspect it and don’t need to generate it again:

with open("collection.json", "w") as f:
    json.dump(collection.to_dict(), f, indent=2)

Step 4: Make sure the collection is usable

Now that we have defined a STAC collection and a collection-level asset we can get use those directly to open the virtual Icechunk store taking any necessary credentials into consideration.

collection = pystac.Collection.from_file("collection.json")

# Get the latest version of the collection-level virtual asset
assets = collection.get_assets(role="latest-version")
asset = next(iter(assets.values()))

xr.open_dataset(asset)

Final Step - Publish the collection

The last step is to publish the client by opening a PR in veda-data or using the ingest UI. For the ingest UI, you will need to authenticate with Keycloak/Cilogon: https://docs.openveda.cloud/user-guide/data-services/setup-keycloak.html