Azavea publishes public cloud dataset for satellite imagery machine learning applications

By James McClain, Azavea

Detection and removal of clouds in satellite imagery is a natural application of machine learning, but freely available training datasets have been in short supply. To address this, Azavea has produced a public dataset consisting of 32 unique Sentinel-2 tiles with cloud labels produced by humans. The tiles are present in L1C (top of atmosphere) and L2A (surface reflectance) versions, they cover 25 unique locations, and all four seasons are represented in the dataset.

The data can be found in a publicly accessible, requester pays S3 bucket: s3://azavea-cloud-model/. The total size of the dataset is about 80GB.

There are three files associated with every tile in the dataset: a file (a STAC archive containing the vector labels), a *L1C-0.tif file (a GeoTiff file containing an L1C version of the tile), and a *L2A-0.tif file (a GeoTiff file containing an L2A version of the tile). The list of tiles in the dataset can be found on GitHub (that file contains the locations of the files and the *L1C-0.tif files, the locations of the L2A files are deducible from the locations of the L1C files).