By James McClain, Azavea
Detection and removal of clouds in satellite imagery is a natural application of machine learning, but freely available training datasets have been in short supply. To address this, Azavea has produced a public dataset consisting of 32 unique Sentinel-2 tiles with cloud labels produced by humans. The tiles are present in L1C (top of atmosphere) and L2A (surface reflectance) versions, they cover 25 unique locations, and all four seasons are represented in the dataset.
The data can be found in a publicly accessible, requester pays S3 bucket: s3://azavea-cloud-model/. The total size of the dataset is about 80GB.
There are three files associated with every tile in the dataset: a catalog.zip file (a STAC archive containing the vector labels), a *L1C-0.tif file (a GeoTiff file containing an L1C version of the tile), and a *L2A-0.tif file (a GeoTiff file containing an L2A version of the tile). The list of tiles in the dataset can be found on GitHub (that file contains the locations of the catalog.zip files and the *L1C-0.tif files, the locations of the L2A files are deducible from the locations of the L1C files).