Publishing a CMIP6 Kerchunk Reference to STAC

Tutorial for data providers who want to create a kerchunk reference for NetCDF files.

Author

Aimee Barciauskas

Published

November 17, 2023

Run this notebook

You can launch this notebook in VEDA JupyterHub by clicking the link below.

Launch in VEDA JupyterHub (requires access)

Learn more

Inside the Hub

This notebook was written on a VEDA JupyterHub instance

See (VEDA Analytics JupyterHub Access)[https://nasa-impact.github.io/veda-docs/veda-jh-access.html] for information about how to gain access.

Outside the Hub

You are welcome to run this anywhere you like (Note: alternatively you can run this on https://daskhub.veda.smce.nasa.gov/, MAAP, locally, …), just make sure that the data is accessible, or get in contact with the VEDA team to enable access.

Approach

This notebook creates STAC collection metadata for a CMIP6 Kerchunk Reference File which has already been generated and stored in S3.

This notebook serves as documentation for the publication of the CMIP6 kerchunk reference. It is not expected to generalize for arbitrary Zarr datasets but may be a helpful example. It was run on the VEDA JupyterHub and since veda-data-store-staging is a protected bucket it is not expected to work in an environment without access to that bucket.

Step 1: Install and import necessary libraries

#!pip install xstac
import pystac
import requests
import s3fs
import xstac
import fsspec
import xarray as xr

Step 2: Open the dataset with `xarray`

dataset_url = 's3://veda-data-store-staging/cmip6-GISS-E2-1-G-tas-kerchunk/combined_CMIP6_daily_GISS-E2-1-G_tas_kerchunk.json'

xr_open_args = {
    "engine": "zarr",
    "decode_coords": "all",
    "consolidated": False
}

fs = fsspec.filesystem(
    "reference",
    fo=dataset_url,
    remote_options={"anon": True},
)
src_path = fs.get_mapper("")

ds = xr.open_dataset(src_path, **xr_open_args)

/tmp/ipykernel_5419/732403854.py:16: UserWarning: Variable(s) referenced in cell_measures not in variables: ['areacella']
  ds = xr.open_dataset(src_path, **xr_open_args)

Step 3: Generate STAC metadata

The spatial extent is taken from the xarray metadata. The temporal extent will be added by the xstac library.

spatial_extent_values = [ds.lon[0].values, ds.lat[0].values, ds.lon[-1].values, ds.lat[-1].values]
spatial_extent = list(map(int, spatial_extent_values))
_id = 'combined_CMIP6_daily_GISS-E2-1-G_tas_kerchunk_TEST'
zarr_asset = pystac.Asset(
    title='zarr',
    href=dataset_url,
    media_type='application/vnd+zarr',
    roles=['data'],
)
extent = pystac.Extent(
    spatial=pystac.SpatialExtent(bboxes=[spatial_extent]),
    temporal=pystac.TemporalExtent([[None, None]])
)

Add the VEDA provider.

providers = [
    pystac.Provider(
        name="VEDA",
        roles=[pystac.ProviderRole.PRODUCER, pystac.ProviderRole.PROCESSOR, pystac.ProviderRole.HOST],
        url="https://github.com/nasa-impact/veda-data-pipelines",
    )
]

Put it all together to intialize a pystac.Collection instance.

collection = pystac.Collection(
    id=_id,
    extent=extent,
    assets = {'zarr': zarr_asset},
    description='for zarr testing',
    providers=providers,
    stac_extensions=['https://stac-extensions.github.io/datacube/v2.0.0/schema.json'],
    license="CC0-1.0"
)

That collection instance is used by xstac to generate additional metadata, such as the temporal extent and the datacube extension information.

collection_template = collection.to_dict()
collection = xstac.xarray_to_stac(
    ds,
    collection_template,
    temporal_dimension="time",
    x_dimension="lon",
    y_dimension="lat",
    # TODO: get this from attributes if possible
    reference_system="4326",
    validate=False
)
# It should validate, yay!
collection.validate()

['https://schemas.stacspec.org/v1.0.0/collection-spec/json-schema/collection.json',
 'https://stac-extensions.github.io/datacube/v2.0.0/schema.json']

Final Step - Publish the collection

Finally, we will publish the client using the VEDA STAC Ingestor API. If you are trying to publish to the VEDA STAC API but don’t have credentials for the STAC ingestor, this is a good time to ask for help and take a break. If you are not trying to publish to the VEDA STAC API but you are using pgSTAC, you should be able to write the collection to a json file and upload to the location of your static catalog publish using pypgstac.

# The VEDA STAC ingestor requires a few more fields
dataset = collection.to_dict()
dataset['data_type'] = 'zarr'
dataset['collection'] = _id
dataset['title'] = 'CMIP6 Daily GISS-E2-1-G TAS Kerchunk (DEMO)'
dataset['dashboard:is_periodic'] = True
dataset['dashboard:time_density'] = 'day'

# You may need to install cognito client
from cognito_client import CognitoClient

STAC_INGESTOR_API = "https://6r8ht9b123.execute-api.us-west-2.amazonaws.com/dev/"
client = CognitoClient(
    client_id="CHANGE ME",
    user_pool_id="CHANGE ME",
    identity_pool_id="CHANGE ME",
)
_ = client.login()

TOKEN = client.access_token

auth_header = f"Bearer {TOKEN}"
headers = {
    "Authorization": auth_header,
    "content-type": "application/json",
    "accept": "application/json",
}
response = requests.post((STAC_INGESTOR_API + "api/ingest/collections"), json=dataset, headers=headers)

print(response.text)