Detecting cloud presence in satellite images using the RGB-based CLIP vision-language model

Czerkawski, Mikolaj and Atkinson, Robert and Tachtatzis, Christos (2023) Detecting cloud presence in satellite images using the RGB-based CLIP vision-language model. In: International Geoscience and Remote Sensing Symposium, 2023-07-16 - 2023-07-21.

[thumbnail of Czerkawski-etal-IGARSS-2023-Detecting-cloud-presence-in-satellite-images-using-the-RGB-based-CLIP-vision-language-model]
Text. Filename: Czerkawski_etal_IGARSS_2023_Detecting_cloud_presence_in_satellite_images_using_the_RGB_based_CLIP_vision_language_model.pdf
Accepted Author Manuscript
License: Strathprints license 1.0

Download (520kB)| Preview


The text medium has begun to play a prominent role in the processing of visual data over the last years, such as images [1, 2, 3, 4, 5, 6, 7], or videos [8, 9, 10]. The use of language allows human users to easily adapt the computer vision tools to their needs and so far, it has primarily been used for purely creative purposes. Yet, vision-language models could also pave the way for many remote sensing applications that can be defined in a zero-shot manner, without the need for extensive training or any training at all. At the core of many text-based vision solutions stands CLIP, a vision-language model designed for measuring alignment between text and image inputs [1]. In this work, the capability of the CLIP model to recognize cloud-affected satellite images is investigated. The approach to this is not immediately obvious; the CLIP model operates on RGB images, while a typical solution to detect clouds in satellite imagery involves more than the RGB visible bands, such as infrared, and is often sensorspecific. Some past works have explored the potential of an RGB-only cloud detection model [11], but the task is considered significantly more challenging. Furthermore, the CLIP model has been trained on the general WebImageText dataset [1], so it is not currently obvious how well it could perform with a task as specific as classification of cloud-affected satellite imagery. In this work, the capability of the official pre-trained CLIP model (ViT-B/32 backbone) is put to test. There are two important insights gained here: it allows to estimate the utility of representations learned by CLIP for cloud-oriented tasks (which can potentially lead to more complex uses such as segmentation or removal), and further, it can act as a tool for filtering datasets based on the presence of clouds. The CLIP model [1] has been designed for zero-shot classification of images where labels can be supplied (and hence, specified as text) upon inference. The CLIP model consists of separate encoders for text and image input, with jointly learned embedding space. A relative measure of alignment between a given text-image pair can be obtained by computing the cosine similarity between the encodings. The manuscript explores four variants of using CLIP for cloud presence detection, shown in Table 1, one (fully zero-shot) based on text prompts (1), and (2)-(4) based on minor (1,000 gradient steps with batch size of 10, on only the training dataset) fine-tuning of the high-level classifier module. In the case of (2), a linear probe is attached to the features encoded by the image encoder. In the case of (3), a CoOp approach is employed, as described in [12]. Finally, the Radar (4) approach applies a linear probe classifier to the image encodings of both RGB data and a false-color composite of the SAR Data (VV, VH, and mean of the two channels are encoded as 3 input channels). Furthermore, the learned approaches (2)-(4) are tested for (dataset/sensor) transferability. The (a) variants correspond to the training and testing data coming from the same sensor, while the (b) variants employ transfer. The text prompts for method (1) were arbitrarily selected as "This is a satellite image with clouds" and "This is a satellite image with clear sky" with no attempt to improve them.