import xarray as xr
import hvplot.xarray
import cartopy.crs as ccrsDownsample zarr
xarray and dask with data from SMAP
Run this notebook
You can launch this notebook in VEDA JupyterHub by clicking the link below.
Launch in VEDA JupyterHub (requires access)
Learn more
Inside the Hub
This notebook was written on the VEDA JupyterHub and as such is designed to be run on a jupyterhub which is associated with an AWS IAM role which has been granted permissions to the VEDA data store via its bucket policy. The instance used provided 16GB of RAM.
See (VEDA Analytics JupyterHub Access)[https://nasa-impact.github.io/veda-docs/veda-jh-access.html] for information about how to gain access.
Outside the Hub
The data is in a protected bucket. Please request access by emailng aimee@developmentseed.org or alexandra@developmentseed.org and providing your affiliation, interest in or expected use of the dataset and an AWS IAM role or user Amazon Resource Name (ARN). The team will help you configure the cognito client.
You should then run:
%run -i 'cognito_login.py'
Approach
This notebook demonstrates 2 strategies for resampling data from a Zarr dataset in order to visualize within the memory limits of a notebook.
- Downsample the temporal resolution of the data using
xarray.DataArray.resample - Coarsening the spatial resolution of the data using
xarray.DataArray.coarsen
A strategy for visualizing any large amount of data is Datashader which bins data into a fixed 2-D array. Using the rasterize argument within hvplot calls ensures the use of the datashader library to bin the data. Optionally an external Dask cluster is used to parallelize and distribute these large downsampling operations across compute nodes.
About the data
The SMAP mission is an orbiting observatory that measures the amount of water in the surface soil everywhere on Earth.
Load libraries
Optional: Create and Scale a Dask Cluster
We create a separate Dask cluster to speed up reprojecting the data (and other potential computations which could be required and are parallelizable).
Note if you skip this cell you will still be using Dask, you’ll just be using the machine where you are running this notebook.
from dask_gateway import GatewayCluster, Gateway
gateway = Gateway()
clusters = gateway.list_clusters()
# connect to an existing cluster - this is useful when the kernel shutdown in the middle of an interactive session
if clusters:
cluster = gateway.connect(clusters[0].name)
else:
cluster = GatewayCluster(shutdown_on_close=True)
cluster.scale(16)
client = cluster.get_client()
clientClient
Client-05313042-8cbb-11f0-816a-9ee98308dd31
| Connection method: Cluster object | Cluster type: dask_gateway.GatewayCluster |
| Dashboard: /services/dask-gateway/clusters/prod.ad9ae8551f5e47a48262103ee2e6acb9/status |
Cluster Info
GatewayCluster
- Name: prod.ad9ae8551f5e47a48262103ee2e6acb9
- Dashboard: /services/dask-gateway/clusters/prod.ad9ae8551f5e47a48262103ee2e6acb9/status
Open the dataset from S3
ds = xr.open_zarr("s3://veda-data-store-staging/EIS/zarr/SPL3SMP.zarr")
ds<xarray.Dataset> Size: 68GB
Dimensions: (northing_m: 406, easting_m: 964,
datetime: 1679)
Coordinates:
* datetime (datetime) datetime64[ns] 13kB 2018-01-01 ...
* easting_m (easting_m) float64 8kB -1.735e+07 ... 1.7...
* northing_m (northing_m) float64 3kB 7.297e+06 ... -7....
Data variables: (12/26)
albedo (northing_m, easting_m, datetime) float32 3GB dask.array<chunksize=(100, 100, 100), meta=np.ndarray>
albedo_pm (northing_m, easting_m, datetime) float32 3GB dask.array<chunksize=(100, 100, 100), meta=np.ndarray>
bulk_density (northing_m, easting_m, datetime) float32 3GB dask.array<chunksize=(100, 100, 100), meta=np.ndarray>
bulk_density_pm (northing_m, easting_m, datetime) float32 3GB dask.array<chunksize=(100, 100, 100), meta=np.ndarray>
clay_fraction (northing_m, easting_m, datetime) float32 3GB dask.array<chunksize=(100, 100, 100), meta=np.ndarray>
clay_fraction_pm (northing_m, easting_m, datetime) float32 3GB dask.array<chunksize=(100, 100, 100), meta=np.ndarray>
... ...
static_water_body_fraction (northing_m, easting_m, datetime) float32 3GB dask.array<chunksize=(100, 100, 100), meta=np.ndarray>
static_water_body_fraction_pm (northing_m, easting_m, datetime) float32 3GB dask.array<chunksize=(100, 100, 100), meta=np.ndarray>
surface_flag (northing_m, easting_m, datetime) float32 3GB dask.array<chunksize=(100, 100, 100), meta=np.ndarray>
surface_flag_pm (northing_m, easting_m, datetime) float32 3GB dask.array<chunksize=(100, 100, 100), meta=np.ndarray>
surface_temperature (northing_m, easting_m, datetime) float32 3GB dask.array<chunksize=(100, 100, 100), meta=np.ndarray>
surface_temperature_pm (northing_m, easting_m, datetime) float32 3GB dask.array<chunksize=(100, 100, 100), meta=np.ndarray>Select the variable of interest (soil moisture for this example).
soil_moisture = ds.soil_moisture
soil_moisture<xarray.DataArray 'soil_moisture' (northing_m: 406, easting_m: 964,
datetime: 1679)> Size: 3GB
dask.array<open_dataset-soil_moisture, shape=(406, 964, 1679), dtype=float32, chunksize=(100, 100, 100), chunktype=numpy.ndarray>
Coordinates:
* datetime (datetime) datetime64[ns] 13kB 2018-01-01 ... 2022-09-09
* easting_m (easting_m) float64 8kB -1.735e+07 -1.731e+07 ... 1.735e+07
* northing_m (northing_m) float64 3kB 7.297e+06 7.26e+06 ... -7.297e+06
Attributes:
long_name: Representative DCA soil moisture measurement for the Earth ba...
units: cm**3/cm**3
valid_max: 0.5
valid_min: 0.019999999552965164Strategy 1: Downsample the temporal resolution of the data
To plot one day from every month, resample the data to 1 observation a month.
somo_one_month = soil_moisture.resample(datetime="1ME").nearest()Notice that that took very little time because it was just setting up the calculation. We can call .compute() to tell dask to trigger evaluation.
%%time
somo_one_month_evaluated = somo_one_month.compute()CPU times: user 94.9 ms, sys: 64.3 ms, total: 159 ms
Wall time: 3.97 s
Quick plot
We can generate a quick plot using hvplot and datashader.
somo_one_month_evaluated.hvplot(
x="easting_m",
y="northing_m",
groupby="datetime",
crs=ccrs.epsg(6933), # this is a workaround for https://github.com/holoviz/hvplot/issues/1329
rasterize=True,
geo=True,
coastline=True,
aggregator="mean",
frame_height=400,
widget_location="bottom",
)Reproject before plotting
Reproject the data for map visualization.
somo_one_month = somo_one_month.transpose("datetime", "northing_m", "easting_m")
somo_one_month = somo_one_month.rio.set_spatial_dims(
x_dim="easting_m", y_dim="northing_m"
)
somo_one_month = somo_one_month.rio.write_crs("EPSG:6933")
somo_reprojected = somo_one_month.rio.reproject("EPSG:4326")
somo_reprojected<xarray.DataArray 'soil_moisture' (datetime: 57, y: 1046, x: 2214)> Size: 528MB
array([[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
...
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]]],
shape=(57, 1046, 2214), dtype=float32)
Coordinates:
* x (x) float64 18kB -179.9 -179.8 -179.6 ... 179.6 179.8 179.9
* y (y) float64 8kB 84.96 84.8 84.64 84.48 ... -84.64 -84.8 -84.96
* datetime (datetime) datetime64[ns] 456B 2018-01-31 ... 2022-09-30
spatial_ref int64 8B 0
Attributes:
long_name: Representative DCA soil moisture measurement for the Earth ba...
units: cm**3/cm**3
valid_max: 0.5
valid_min: 0.019999999552965164[!NOTE] This is now a fully materialized data array - when we reprojected we triggered an implicit compute.
somo_reprojected.hvplot(
x="x",
y="y",
groupby="datetime",
coastline=True,
rasterize=True,
aggregator="mean",
frame_height=400,
widget_location="bottom",
)Strategy 2: Coarsen spatial resolution of the data
Below, we coarsen the spatial resolution of the data by a factor of 4 in the x and 2 in the y. These values were chosen because they can be used with the exact boundary argument as the dimensions size is a multiple of these values.
You can also coarsen by datetime, using the same strategy as below but replacing easting_m and northing_m with datetime. If {datetime: n} is the value given to the dim argument, this would create a mean of the soil moisture average for n days.
Once the data has been coarsened, again it is reprojected for map visualization and then visualized.
coarsened = soil_moisture.coarsen(dim={"easting_m": 4, "northing_m": 2}).mean()
coarsened = coarsened.transpose("datetime", "northing_m", "easting_m")
coarsened = coarsened.rio.set_spatial_dims(x_dim="easting_m", y_dim="northing_m")
coarsened = coarsened.rio.write_crs("epsg:6933")
coarsened_reprojected = coarsened.rio.reproject("EPSG:4326")
coarsened_reprojected<xarray.DataArray 'soil_moisture' (datetime: 1679, y: 315, x: 667)> Size: 1GB
array([[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
...
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]]],
shape=(1679, 315, 667), dtype=float32)
Coordinates:
* x (x) float64 5kB -179.7 -179.2 -178.7 ... 178.6 179.2 179.7
* y (y) float64 3kB 84.77 84.23 83.7 83.16 ... -83.64 -84.18 -84.72
* datetime (datetime) datetime64[ns] 13kB 2018-01-01 ... 2022-09-09
spatial_ref int64 8B 0
Attributes:
long_name: Representative DCA soil moisture measurement for the Earth b...
units: cm**3/cm**3
valid_max: 0.5
valid_min: 0.019999999552965164
_FillValue: nancoarsened_reprojected.hvplot(
x="x",
y="y",
groupby="datetime",
coastline=True,
rasterize=True,
aggregator="mean",
frame_height=400,
widget_location="bottom",
)Cleanup
When using a remote Dask cluster it is recommented to explicitly close the cluster.
client.shutdown()