Leveraging Contrastive Learning for Semantic Segmentation with Consistent Labels Across Varying Appearances

Feature alignment is common in domain adaptation for classification, but it is harder to apply to semantic segmentation. Segmentation requires dense labels, and alignment losses work best when different appearances share the same pixel-level ground truth. Real images rarely provide that structure, and generative methods often fail to preserve exact correspondence between image content and labels.

Simulation gives direct control over this constraint. Using CARLA, I generated a semantic segmentation dataset where each scene has multiple visual appearances while preserving the same ground-truth mask. This makes it possible to train feature consistency across appearances without pseudo-labeling.

In this work, the contrastive signal comes from comparing aligned appearances of the same scene. Since each appearance shares the same pixel-level ground truth, the model can learn feature consistency across visual conditions.

I evaluated this setup with DAFormer in UDA and domain generalization pipelines. The experiments compare standard training with all appearances against feature consistency losses computed across aligned appearances. The results show that pixel-aligned appearances improve segmentation performance, especially when using cosine-similarity alignment.

Although the dataset has less object diversity than GTA V, with 18 CARLA vehicle models compared with more than 150 in GTA V, it still performs better in domain generalization experiments. This suggests that precise label consistency across appearances can compensate for lower asset diversity in some training setups.

The table reports per-class performance across different alignment metrics for Ours to Cityscapes. Bold values indicate the best result for each class. Baseline refers to standard training with all appearances.

Alignment	road	sidewalk	building	wall	fence	pole	traffic light	traffic sign	vegetation	sky	person	rider	car	truck	motorcycle	bicycle	mIoU	mAcc
Baseline	91.5	55.9	89.1	43.4	30.1	50.1	48.4	59.8	88.3	92.7	38.8	18.3	88.0	25.7	49.9	61.9	58.3	71.2

L2	93.6	62.7	88.8	45.3	24.5	50.8	51.3	56.1	88.1	93.1	69.9	44.3	84.3	19.6	41.7	58.1	60.8	72.4
MMD	90.1	53.5	88.2	41.9	30.5	49.0	51.3	55.9	87.8	92.4	69.8	44.6	88.4	28.0	49.9	57.6	61.2	71.6
CS	95.4	68.6	89.2	49.2	33.7	50.4	49.6	59.2	89.0	92.5	68.0	41.1	86.5	21.7	48.9	61.6	62.8	74.8

The dataset also improves domain generalization across target datasets. All methods were trained with the same hyperparameters and number of iterations. Bold values indicate the best result for each target dataset.

Dataset	Cityscapes	ACDC	Dark Zurich
GTA	48.9	39.8	20.4
Synthia	40.5	31.8	17.8

Ours	51.0	39.8	24.9

The full paper is available on arXiv.