Leveraging Contrastive Learning for Semantic Segmentation with Consistent Labels Across Varying Appearances
Feature alignment is common in domain adaptation for classification, but it is harder to apply to semantic segmentation. Segmentation requires dense labels, and alignment losses work best when different appearances share the same pixel-level ground truth. Real images rarely provide that structure, and generative methods often fail to preserve exact correspondence between image content and labels.
Simulation gives direct control over this constraint. Using CARLA, I generated a semantic segmentation dataset where each scene has multiple visual appearances while preserving the same ground-truth mask. This makes it possible to train feature consistency across appearances without pseudo-labeling.
In this work, the contrastive signal comes from comparing aligned appearances of the same scene. Since each appearance shares the same pixel-level ground truth, the model can learn feature consistency across visual conditions.




I evaluated this setup with DAFormer in UDA and domain generalization pipelines. The experiments compare standard training with all appearances against feature consistency losses computed across aligned appearances. The results show that pixel-aligned appearances improve segmentation performance, especially when using cosine-similarity alignment.
Although the dataset has less object diversity than GTA V, with 18 CARLA vehicle models compared with more than 150 in GTA V, it still performs better in domain generalization experiments. This suggests that precise label consistency across appearances can compensate for lower asset diversity in some training setups.
The table reports per-class performance across different alignment metrics for Ours to Cityscapes. Bold values indicate the best result for each class. Baseline refers to standard training with all appearances.
Alignment | road | sidewalk | building | wall | fence | pole | traffic light | traffic sign | vegetation | sky | person | rider | car | truck | motorcycle | bicycle | mIoU | mAcc |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Baseline | 91.5 | 55.9 | 89.1 | 43.4 | 30.1 | 50.1 | 48.4 | 59.8 | 88.3 | 92.7 | 38.8 | 18.3 | 88.0 | 25.7 | 49.9 | 61.9 | 58.3 | 71.2 |
| L2 | 93.6 | 62.7 | 88.8 | 45.3 | 24.5 | 50.8 | 51.3 | 56.1 | 88.1 | 93.1 | 69.9 | 44.3 | 84.3 | 19.6 | 41.7 | 58.1 | 60.8 | 72.4 |
| MMD | 90.1 | 53.5 | 88.2 | 41.9 | 30.5 | 49.0 | 51.3 | 55.9 | 87.8 | 92.4 | 69.8 | 44.6 | 88.4 | 28.0 | 49.9 | 57.6 | 61.2 | 71.6 |
| CS | 95.4 | 68.6 | 89.2 | 49.2 | 33.7 | 50.4 | 49.6 | 59.2 | 89.0 | 92.5 | 68.0 | 41.1 | 86.5 | 21.7 | 48.9 | 61.6 | 62.8 | 74.8 |
The dataset also improves domain generalization across target datasets. All methods were trained with the same hyperparameters and number of iterations. Bold values indicate the best result for each target dataset.
Dataset | Cityscapes | ACDC | Dark Zurich |
|---|---|---|---|
| GTA | 48.9 | 39.8 | 20.4 |
| Synthia | 40.5 | 31.8 | 17.8 |
| Ours | 51.0 | 39.8 | 24.9 |
The full paper is available on arXiv.
