Ingestion Workflow for Uploading Data to the VEDA Catalog for the VEDA Dashboard

A walk through of the ingestion workflow for data providers who want to add a new dataset to the VEDA Dashboard.

Author

Jonas Sølvsteen, Kathryn Berger

Published

July 25, 2023

Approach

This notebook is intended to be used as a reference for data providers who want to add new datasets to the VEDA Dashboard. Please read the documentation for Data Ingestion before moving forward with this notebook example.

For example purposes, we will walk you through adding the GEOGLAM June 2023 dataset directly to the VEDA Dashboard.

Validate the GeoTIFF
Upload the file to the staging S3 bucket (veda-data-store-staging)
Use the Ingest UI (ingest.openveda.cloud/) to generate STAC metadata for the file and add to the staging STAC catalog (staging.openveda.cloud)

When the data has been published to the STAC metadata catalog for this geoglam collection, which is already configured for the dashboard, it will be available in the VEDA Dashboard.

1. Validate data format

Below we will import some geospatial tools for validation and define some of the variables to be used including the TARGET_FILENAME for the datafile you want to upload. Note that in this example we will demonstrate the ingestion of GEOGLAM’s June 2023 data. It is important that the file you want to upload (e.g., CropMonitor_2023_06_28.tif ) is located in the same repository folder as this notebook.

import os

import rio_cogeo
import rasterio
import boto3
import requests

In the cell below, we are using TARGET_FILENAME to revise the LOCAL_FILE_PATH into the correct file format as advised in the File preparation documentation. See example formats in the link provided.

If the LOCAL_FILE_PATH is already properly formatted, then both LOCAL_FILE_PATH and TARGET_FILENAME will be identical.

LOCAL_FILE_PATH = "CropMonitor_2023_06_28.tif"
YEAR, MONTH = 2023, 6

TARGET_FILENAME = f"CropMonitor_{YEAR}{MONTH:02}.tif"

The following code is used to test whether the data format you are planning to upload is a Cloud Optimized GeoTiff (COG) which enables more efficient workflows in the cloud environment. If the validation process identifies that file is not a COG, it will convert it into one.

file_is_a_cog = rio_cogeo.cog_validate(LOCAL_FILE_PATH)
if not file_is_a_cog:
    raise ValueError()
    print("File is not a COG - converting")
    rio_cogeo.cog_translate(LOCAL_FILE_PATH, LOCAL_FILE_PATH, in_memory=True)

2. Upload file to S3

The following code will upload your COG data into veda-data-store-staging bucket. It will use the TARGET_FILENAME to assign the correct month and year values we have provided earlier in this notebook, under the geoglam bucket on S3.

s3 = boto3.client("s3")
BUCKET = "veda-data-store-staging"
KEY = f"{BUCKET}/geoglam/{TARGET_FILENAME}"
S3_FILE_LOCATION = f"s3://{KEY}"

if False:
    s3.upload_file(LOCAL_FILE_PATH, KEY)

3. Use the Ingest UI to add this geoglam item to the staging catalog

For this step, open the Ingest UI at ingest.openveda.cloud in a second browser tab and click the “Sign in with Keycloak” button to authenticate your session. You will be temporarily redirected to CILogon. Please use an Identity Provider that is associated to your primary work or institution email address to authorize. Then you will be redirected back to the Ingest UI. The cells below will guide you through how to use the Ingest UI to stage your data.

3a. Construct dataset definition

Here, the data provider will construct the dataset definition (and supporting metadata) that will be used for dataset ingestion. It is imperative that these values are correct and align to the data the provider is planning to upload to the VEDA Platform. For example, make sure that the startdate and enddate are realistic (e.g., an "enddate":"2023-06-31T23:59:59Z" would be an incorrect value for June, as it contains only 31 days).

For further detail on metadata required for entries in the VEDA STAC to work with the VEDA Dashboard, see documentation here. In particular, note recommendations for the fields dashboard:is_periodic and dashboard:time_density. For example, in the code block below we define the dashboard:is_periodic field as False because we are ingesting only one month of data. Even though we know that the monthly observations are provided routinely by GEOGLAM, we will only have a single file to ingest and so do not have a temporal range of items in the collection with a monthly time density to generate a time picker from the available data.

Note Several OPTIONAL properties are added to this dataset config for completeness. Your dataset json does NOT need to include these optional properties * assets * item_assets * renders

import json

dataset = {
    "collection": "geoglam",
    "title": "GEOGLAM Crop Monitor",
    "data_type": "cog",
    "spatial_extent": {
    "xmin": -180,
    "ymin": -90,
    "xmax": 180,
    "ymax": 90
    },
    "temporal_extent": {
    "startdate": "2020-01-01T00:00:00Z",
    "enddate": "2023-06-30T23:59:59Z"
    },
    "license": "MIT",
    "description": "The Crop Monitors were designed to provide a public good of open, timely, science-driven information on crop conditions in support of market transparency for the G20 Agricultural Market Information System (AMIS). Reflecting an international, multi-source, consensus assessment of crop growing conditions, status, and agro-climatic factors likely to impact global production, focusing on the major producing and trading countries for the four primary crops monitored by AMIS (wheat, maize, rice, and soybeans). The Crop Monitor for AMIS brings together over 40 partners from national, regional (i.e. sub-continental), and global monitoring systems, space agencies, agriculture organizations and universities. Read more: https://cropmonitor.org/index.php/about/aboutus/",
    "dashboard:is_periodic": False,
    "dashboard:time_density": "month",
    ## NOTE: email the veda team at veda@uah.edu to upload a new thumbnail for your dataset
    "assets": {
        "thumbnail": {
            "href": "https://thumbnails.openveda.cloud/geoglam--dataset-cover.jpg",
            "type": "image/jpeg",
            "roles": ["thumbnail"],
            "title": "Thumbnail",
            "description": "Photo by [Jean Wimmerlin](https://unsplash.com/photos/RUj5b4YXaHE) (Bird's eye view of fields)"
        }
    },
    ## RENDERS metadata are OPTIONAL but provided below
    "renders": {
        "dashboard": {
            "bidx": [1],
            "title": "VEDA Dashboard Render Parameters",
            "assets": [
            "cog_default"
            ],
            "unscale": False,
            "colormap": {
                "1": [120, 120, 120],
                "2": [130, 65, 0],
                "3": [66, 207, 56],
                "4": [245, 239, 0],
                "5": [241, 89, 32],
                "6": [168, 0, 0],
                "7": [0, 143, 201]
            },
            "max_size": 1024,
            "resampling": "nearest",
            "return_mask": True
        }
    },
    ## IMPORTANT update providers for a your data, some are specific to each collection
    "providers": [
    {
        "url": "https://data.nal.usda.gov/dataset/geoglam-geo-global-agricultural-monitoring-crop-assessment-tool#:~:text=The%20GEOGLAM%20crop%20calendars%20are,USDA%20FAS%2C%20and%20USDA%20NASS.",
        "name": "USDA & Global Crop Monitor Group partners",
        "roles": [
            "producer",
            "processor",
            "licensor"
        ]
    },
        {
            "url": "https://www.earthdata.nasa.gov/dashboard/",
            "name": "NASA VEDA",
            "roles": ["host"]
        }
    ],
    ## item_assets are OPTIONAL but pre-filled here
    "item_assets": {
        "cog_default": {
            "type": "image/tiff; application=geotiff; profile=cloud-optimized",
                "roles": ["data","layer"],
            "title": "Default COG Layer",
            "description": "Cloud optimized default layer to display on map"
        }
    },
    "sample_files": [
        "s3://veda-data-store-staging/geoglam/CropMonitor_202306.tif"
    ],
    "discovery_items": [
        {
          "discovery": "s3",
          "prefix": "geoglam/",
          "bucket": "veda-data-store-staging",
          "filename_regex": "(.*)CropMonitor_202306.tif$"
        }
    ]
}

print(json.dumps(dataset, indent=2))

{
  "collection": "geoglam",
  "title": "GEOGLAM Crop Monitor",
  "data_type": "cog",
  "spatial_extent": {
    "xmin": -180,
    "ymin": -90,
    "xmax": 180,
    "ymax": 90
  },
  "temporal_extent": {
    "startdate": "2020-01-01T00:00:00Z",
    "enddate": "2023-06-30T23:59:59Z"
  },
  "license": "MIT",
  "description": "The Crop Monitors were designed to provide a public good of open, timely, science-driven information on crop conditions in support of market transparency for the G20 Agricultural Market Information System (AMIS). Reflecting an international, multi-source, consensus assessment of crop growing conditions, status, and agro-climatic factors likely to impact global production, focusing on the major producing and trading countries for the four primary crops monitored by AMIS (wheat, maize, rice, and soybeans). The Crop Monitor for AMIS brings together over 40 partners from national, regional (i.e. sub-continental), and global monitoring systems, space agencies, agriculture organizations and universities. Read more: https://cropmonitor.org/index.php/about/aboutus/",
  "dashboard:is_periodic": false,
  "dashboard:time_density": "month",
  "assets": {
    "thumbnail": {
      "href": "https://thumbnails.openveda.cloud/geoglam--dataset-cover.jpg",
      "type": "image/jpeg",
      "roles": [
        "thumbnail"
      ],
      "title": "Thumbnail",
      "description": "Photo by [Jean Wimmerlin](https://unsplash.com/photos/RUj5b4YXaHE) (Bird's eye view of fields)"
    }
  },
  "renders": {
    "dashboard": {
      "bidx": [
        1
      ],
      "title": "VEDA Dashboard Render Parameters",
      "assets": [
        "cog_default"
      ],
      "unscale": false,
      "colormap": {
        "1": [
          120,
          120,
          120
        ],
        "2": [
          130,
          65,
          0
        ],
        "3": [
          66,
          207,
          56
        ],
        "4": [
          245,
          239,
          0
        ],
        "5": [
          241,
          89,
          32
        ],
        "6": [
          168,
          0,
          0
        ],
        "7": [
          0,
          143,
          201
        ]
      },
      "max_size": 1024,
      "resampling": "nearest",
      "return_mask": true
    }
  },
  "providers": [
    {
      "url": "https://data.nal.usda.gov/dataset/geoglam-geo-global-agricultural-monitoring-crop-assessment-tool#:~:text=The%20GEOGLAM%20crop%20calendars%20are,USDA%20FAS%2C%20and%20USDA%20NASS.",
      "name": "USDA & Global Crop Monitor Group partners",
      "roles": [
        "producer",
        "processor",
        "licensor"
      ]
    },
    {
      "url": "https://www.earthdata.nasa.gov/dashboard/",
      "name": "NASA VEDA",
      "roles": [
        "host"
      ]
    }
  ],
  "item_assets": {
    "cog_default": {
      "type": "image/tiff; application=geotiff; profile=cloud-optimized",
      "roles": [
        "data",
        "layer"
      ],
      "title": "Default COG Layer",
      "description": "Cloud optimized default layer to display on map"
    }
  },
  "sample_files": [
    "s3://veda-data-store-staging/geoglam/CropMonitor_202306.tif"
  ],
  "discovery_items": [
    {
      "discovery": "s3",
      "prefix": "geoglam/",
      "bucket": "veda-data-store-staging",
      "filename_regex": "(.*)CropMonitor_202306.tif$"
    }
  ]
}

3b. Validate dataset definition

After composing your dataset definition, navigate to “Create Ingest” in the Ingest UI. There, you will see headers for Form and Manual JSON Edit. Navigate to the Manual JSON Edit page and copy the printed json and paste it into the input on the page.

If the json is valid, the response will confirm that it is ready to be published on the VEDA Platform. Otherwise, you will see a note at the bottom of the webpage in red that says Invalid JSON format.

3c. Publish to STAC

Now that you have validated your dataset, you can initiate a workflow and publish the dataset to the VEDA Platform.

In the Form page of the Ingest UI, you can click Submit to submit your data ingestion.

On success, a veda-data GitHub Pull Request will be opened containing your ingest request and a GitHub Actions workflow will be kicked off to publish your dataset to staging. A member from the VEDA Data Services team will review your PR to see if it is ready for production ingestion.

Congratulations! You have now successfully uploaded a COG dataset to the VEDA Dashboard. You can now explore the data catalog to verify the ingestion process has worked successfully, as now uploaded data should be ready for viewing and exploration.