Leveraging Contrastive Learning for Semantic Segmentation with Consistent Labels Across Varying Appearances

Feature alignment is common in domain adaptation for classification, but it is harder to apply to semantic segmentation. Segmentation requires dense labels, and alignment losses work best when different appearances share the same pixel-level ground truth. Real images rarely provide that structure, and generative methods often fail to preserve exact correspondence between image content and labels.

Simulation gives direct control over this constraint. Using CARLA, I generated a semantic segmentation dataset where each scene has multiple visual appearances while preserving the same ground-truth mask. This makes it possible to train feature consistency across appearances without pseudo-labeling.

In this work, the contrastive signal comes from comparing aligned appearances of the same scene. Since each appearance shares the same pixel-level ground truth, the model can learn feature consistency across visual conditions.

Appearance 1Appearance 2Appearance 3Ground-Truth

I evaluated this setup with DAFormer in UDA and domain generalization pipelines. The experiments compare standard training with all appearances against feature consistency losses computed across aligned appearances. The results show that pixel-aligned appearances improve segmentation performance, especially when using cosine-similarity alignment.

Pipeline Example

Although the dataset has less object diversity than GTA V, with 18 CARLA vehicle models compared with more than 150 in GTA V, it still performs better in domain generalization experiments. This suggests that precise label consistency across appearances can compensate for lower asset diversity in some training setups.

The table reports per-class performance across different alignment metrics for Ours to Cityscapes. Bold values indicate the best result for each class. Baseline refers to standard training with all appearances.

Alignment
road
sidewalk
building
wall
fence
pole
traffic light
traffic sign
vegetation
sky
person
rider
car
truck
motorcycle
bicycle
mIoU
mAcc
Baseline91.555.989.143.430.150.148.459.888.392.738.818.388.025.749.961.958.371.2

L293.662.788.845.324.550.851.356.188.193.169.944.384.319.641.758.160.872.4
MMD90.153.588.241.930.549.051.355.987.892.469.844.688.428.049.957.661.271.6
CS95.468.689.249.233.750.449.659.289.092.568.041.186.521.748.961.662.874.8

The dataset also improves domain generalization across target datasets. All methods were trained with the same hyperparameters and number of iterations. Bold values indicate the best result for each target dataset.

Dataset
Cityscapes
ACDC
Dark Zurich
GTA48.939.820.4
Synthia40.531.817.8

Ours51.039.824.9

The full paper is available on arXiv.