Ingestion Workflow for Uploading Data to the VEDA Catalog for the VEDA Dashboard

A walk through of the ingestion workflow for data providers who want to add a new dataset to the VEDA Dashboard.
Author

Jonas Sølvsteen, Kathryn Berger

Published

July 25, 2023

Approach

This notebook is intented to be used as a reference for data providers who want to add new datasets to the VEDA Dashboard. As always it is important that the data provider has read the documentation for Data Ingestion before moving forward with this notebook example.

For example purposes, we will walk the end user through adding the GEOGLAM June 2023 dataset directly to the VEDA Dashboard.

  1. Validate the GeoTIFF
  2. Upload the file to the staging S3 bucket (veda-data-store-staging)
  3. Use the workflows-api (staging.openveda.cloud/api/workflows/docs) to generate STAC metadata for the file and add to the staging STAC catalog (staging.openveda.cloud)

When the data has been published to the STAC metadata catalog for this geoglam collection, which is already configured for the dashboard, it will be available in the VEDA Dashboard

1. Validate data format

Below we will import some geospatial tools for validation and define some of the variables to be used including the TARGET_FILENAME for the datafile you want to upload. Note that in this example we will demonstrate the ingestion of GEOGLAM’s June 2023 data. It is important that the file you want to upload (e.g., CropMonitor_2023_06_28.tif ) is located in the same repository folder as this notebook.

import os

import rio_cogeo
import rasterio
import boto3
import requests

In the cell below we are using TARGET_FILENAME to revise the LOCAL_FILE_PATH into the correct file format as advised in the File preparation documentation. See example formats in the link provided.

If the LOCAL_FILE_PATH is already properly formatted, then both LOCAL_FILE_PATH and TARGET_FILENAME will be identical.

LOCAL_FILE_PATH = "CropMonitor_2023_06_28.tif"
YEAR, MONTH = 2023, 6

TARGET_FILENAME = f"CropMonitor_{YEAR}{MONTH:02}.tif"

The following code is used to test whether the data format you are planning to upload is Cloud Optimized GeoTiff (COG) that enables more efficient workflows in the cloud environment. If the validation process identifies that it is not a COG, it will convert it into one.

file_is_a_cog = rio_cogeo.cog_validate(LOCAL_FILE_PATH)
if not file_is_a_cog:
    raise ValueError()
    print("File is not a COG - converting")
    rio_cogeo.cog_translate(LOCAL_FILE_PATH, LOCAL_FILE_PATH, in_memory=True)

2. Upload file to S3

The following code will upload your COG data into veda-data-store-staging bucket. It will use the TARGET_FILENAME to assign the correct month and year values we have provided earlier in this notebook, under the geoglam bucket on S3.

s3 = boto3.client("s3")
BUCKET = "veda-data-store-staging"
KEY = f"{BUCKET}/geoglam/{TARGET_FILENAME}"
S3_FILE_LOCATION = f"s3://{KEY}"

if False:
    s3.upload_file(LOCAL_FILE_PATH, KEY)

3. Use the workflows-api to add this geoglam item to the staging catalog

For this step, open the workflows API at staging.openveda.cloud/api/workflows/docs in a second browser tab and click the green authorize button at the upper right to authenticate your session with your username and password (you will be temporarily redirected to a login widget and then back to the workflows-api docs). The cells below will guide you through the process of configuiring your request jsons for each endpoint demonstrated and you will copy the cell outputs into the workflows API in your second tab.

3a. Construct dataset definition

Here the data provider will construct the dataset definition (and supporting metadata) that will be used for dataset ingestion. It is imperative that these values are correct and align to the data the provider is planning to upload to the VEDA Platform. For example, make sure that the startdate and enddate are realistic (e.g., an "enddate":"2023-06-31T23:59:59Z" would be an incorrect value for June, as it contains only 31 days).

For further detail on metadata required for entries in the VEDA STAC to work with the VEDA Dashboard, see documentation here. In particular, note recommendations for the fields is_periodic and time_density. For example, in the code block below we define the is_periodic field as False because we are ingesting only one month of data. Even though we know that the monthly observations are provided routinely by GEOGLAM, we will only have a single file to ingest and so do not have a temporal range of items in the collection with a monthly time density to generate a time picker from the available data.

Note Several OPTIONAL properties are added to this dataset config for completeness. Your dataset json does NOT need to include these optional properties * assets * item_assets * renders

import json

dataset = {
    "collection": "geoglam",
    "title": "GEOGLAM Crop Monitor",
    "data_type": "cog",
    "spatial_extent": {
    "xmin": -180,
    "ymin": -90,
    "xmax": 180,
    "ymax": 90
    },
    "temporal_extent": {
    "startdate": "2020-01-01T00:00:00Z",
    "enddate": "2023-06-30T23:59:59Z"
    },
    "license": "MIT",
    "description": "The Crop Monitors were designed to provide a public good of open, timely, science-driven information on crop conditions in support of market transparency for the G20 Agricultural Market Information System (AMIS). Reflecting an international, multi-source, consensus assessment of crop growing conditions, status, and agro-climatic factors likely to impact global production, focusing on the major producing and trading countries for the four primary crops monitored by AMIS (wheat, maize, rice, and soybeans). The Crop Monitor for AMIS brings together over 40 partners from national, regional (i.e. sub-continental), and global monitoring systems, space agencies, agriculture organizations and universities. Read more: https://cropmonitor.org/index.php/about/aboutus/",
    "is_periodic": False,
    "time_density": "month",
    ## NOTE: email the veda team at veda@uah.edu to upload a new thumbnail for your dataset
    "assets": {
        "thumbnail": {
            "href": "https://thumbnails.openveda.cloud/geoglam--dataset-cover.jpg",
            "type": "image/jpeg",
            "roles": ["thumbnail"],
            "title": "Thumbnail",
            "description": "Photo by [Jean Wimmerlin](https://unsplash.com/photos/RUj5b4YXaHE) (Bird's eye view of fields)"
        }
    },
    ## RENDERS metadata are OPTIONAL but provided below
    "renders": {
        "dashboard": {
            "bidx": [1],
            "title": "VEDA Dashboard Render Parameters",
            "assets": [
            "cog_default"
            ],
            "unscale": False,
            "colormap": {
                "1": [120, 120, 120],
                "2": [130, 65, 0],
                "3": [66, 207, 56],
                "4": [245, 239, 0],
                "5": [241, 89, 32],
                "6": [168, 0, 0],
                "7": [0, 143, 201]
            },
            "max_size": 1024,
            "resampling": "nearest",
            "return_mask": True
        }
    },
    ## IMPORTANT update providers for a your data, some are specific to each collection
    "providers": [
    {
        "url": "https://data.nal.usda.gov/dataset/geoglam-geo-global-agricultural-monitoring-crop-assessment-tool#:~:text=The%20GEOGLAM%20crop%20calendars%20are,USDA%20FAS%2C%20and%20USDA%20NASS.",
        "name": "USDA & Global Crop Monitor Group partners",
        "roles": [
            "producer",
            "processor",
            "licensor"
        ]
    },
        {
            "url": "https://www.earthdata.nasa.gov/dashboard/",
            "name": "NASA VEDA",
            "roles": ["host"]
        }
    ],
    ## item_assets are OPTIONAL but pre-filled here
    "item_assets": {
        "cog_default": {
            "type": "image/tiff; application=geotiff; profile=cloud-optimized",
                "roles": ["data","layer"],
            "title": "Default COG Layer",
            "description": "Cloud optimized default layer to display on map"
        }
    },
    "sample_files": [
        "s3://veda-data-store-staging/geoglam/CropMonitor_202306.tif"
    ],
    "discovery_items": [
        {
          "discovery": "s3",
          "prefix": "geoglam/",
          "bucket": "veda-data-store-staging",
          "filename_regex": "(.*)CropMonitor_202306.tif$"
        }
    ]
}

print(json.dumps(dataset, indent=2))
{
  "collection": "geoglam",
  "title": "GEOGLAM Crop Monitor",
  "data_type": "cog",
  "spatial_extent": {
    "xmin": -180,
    "ymin": -90,
    "xmax": 180,
    "ymax": 90
  },
  "temporal_extent": {
    "startdate": "2020-01-01T00:00:00Z",
    "enddate": "2023-06-30T23:59:59Z"
  },
  "license": "MIT",
  "description": "The Crop Monitors were designed to provide a public good of open, timely, science-driven information on crop conditions in support of market transparency for the G20 Agricultural Market Information System (AMIS). Reflecting an international, multi-source, consensus assessment of crop growing conditions, status, and agro-climatic factors likely to impact global production, focusing on the major producing and trading countries for the four primary crops monitored by AMIS (wheat, maize, rice, and soybeans). The Crop Monitor for AMIS brings together over 40 partners from national, regional (i.e. sub-continental), and global monitoring systems, space agencies, agriculture organizations and universities. Read more: https://cropmonitor.org/index.php/about/aboutus/",
  "is_periodic": false,
  "time_density": "month",
  "assets": {
    "thumbnail": {
      "href": "https://thumbnails.openveda.cloud/geoglam--dataset-cover.jpg",
      "type": "image/jpeg",
      "roles": [
        "thumbnail"
      ],
      "title": "Thumbnail",
      "description": "Photo by [Jean Wimmerlin](https://unsplash.com/photos/RUj5b4YXaHE) (Bird's eye view of fields)"
    }
  },
  "renders": {
    "dashboard": {
      "bidx": [
        1
      ],
      "title": "VEDA Dashboard Render Parameters",
      "assets": [
        "cog_default"
      ],
      "unscale": false,
      "colormap": {
        "1": [
          120,
          120,
          120
        ],
        "2": [
          130,
          65,
          0
        ],
        "3": [
          66,
          207,
          56
        ],
        "4": [
          245,
          239,
          0
        ],
        "5": [
          241,
          89,
          32
        ],
        "6": [
          168,
          0,
          0
        ],
        "7": [
          0,
          143,
          201
        ]
      },
      "max_size": 1024,
      "resampling": "nearest",
      "return_mask": true
    }
  },
  "providers": [
    {
      "url": "https://data.nal.usda.gov/dataset/geoglam-geo-global-agricultural-monitoring-crop-assessment-tool#:~:text=The%20GEOGLAM%20crop%20calendars%20are,USDA%20FAS%2C%20and%20USDA%20NASS.",
      "name": "USDA & Global Crop Monitor Group partners",
      "roles": [
        "producer",
        "processor",
        "licensor"
      ]
    },
    {
      "url": "https://www.earthdata.nasa.gov/dashboard/",
      "name": "NASA VEDA",
      "roles": [
        "host"
      ]
    }
  ],
  "item_assets": {
    "cog_default": {
      "type": "image/tiff; application=geotiff; profile=cloud-optimized",
      "roles": [
        "data",
        "layer"
      ],
      "title": "Default COG Layer",
      "description": "Cloud optimized default layer to display on map"
    }
  },
  "sample_files": [
    "s3://veda-data-store-staging/geoglam/CropMonitor_202306.tif"
  ],
  "discovery_items": [
    {
      "discovery": "s3",
      "prefix": "geoglam/",
      "bucket": "veda-data-store-staging",
      "filename_regex": "(.*)CropMonitor_202306.tif$"
    }
  ]
}

3b. Validate dataset definition

After composing your dataset definition, copy the printed json and paste it into the /dataset/validate input in the workflows-api docs page in the second tab. Note that if you navigate away from this page you will need to click authorize again.

Choose POST dataset/validate in the Dataset section of the API docs at staging.openveda.cloud/api/workflows/docs. Click ’Try it Out` and paste your json into the Request body and then Execute

If the json is valid, the response will confirm that it is ready to be published on the VEDA Platform.

3c. Publish to STAC

Now that you have validated your dataset, you can initiate a workflow and publish the dataset to the VEDA Platform.

Choose POST dataset/publish in the Dataset section of the API docs at staging.openveda.cloud/api/workflows/docs. Click ’Try it Out` and paste your json into the Request body and then Execute

On success, you will recieve a success message containing the id of your workflow, for example

{"message":"Successfully published collection: geoglam. 1  workflows initiated.","workflows_ids":["db6a2097-3e4c-45a3-a772-0c11e6da8b44"]}

Congratulations! You have now successfully uploaded a COG dataset to the VEDA Dashboard. You can now explore the data catalog to verify the ingestion process has worked successfully, as now uploaded data should be ready for viewing and exploration.