Unsupervised Class Generation to Expand Semantic Segmentation Datasets

While working with CARLA-based simulation data, I found a mismatch between the classes present in synthetic datasets and the target labels used by Cityscapes. Some missing labels can be recovered by relabeling existing assets, but others, such as trains, are difficult to add directly because the simulator maps do not contain the required scene structure. I tested whether synthetic object instances could be included in source-domain images and used inside a UDA pipeline to teach the segmentation model those missing classes.

To build the pipeline, I combined Stable Diffusion with SAM. DAAM provides a location prior for the generated object through diffusion attention maps conditioned on the class name. I used that prior to prompt SAM and extract a mask proposal for each generated class instance.

After generation, I curated the masks with three quality metrics to remove noisy samples and build an instance-level dataset.

During training, I included these masks in source images with a CutMix-style procedure and trained DAFormer inside a UDA pipeline. The results show that the model can learn missing classes from overlaid synthetic instances alone. The table reports semantic segmentation mIoU after applying the method to Synthia to Cityscapes and 4AGT to Cityscapes.

Method	Road	Sidewalk	Building	Wall	Fence	Pole	Traffic Light	Traffic Sign	Vegetation	Sky	Person	Rider	Car	Truck	Bus	Train	Motorcycle	Bicycle	mIoU
Synthia -	82.4	37.7	88.7	43.0	8.4	50.8	55.7	55.1	86.0	88.1	74.2	49.5	87.8	-	63.2	-	54.5	62.8	54.9
Synthia Train	87.5	50.2	88.4	44.4	1.6	49.2	53.1	50.7	85.3	92.8	74.5	48.2	85.9	-	70.6	29.7	53.4	60.1	57.0
Synthia Truck	82.5	40.8	88.8	44.6	6.8	50.4	55.5	51.0	85.1	91.7	67.1	47.6	90.5	64.0	60.3	-	55.8	62.2	58.0
Synthia Both	86.5	47.1	88.3	44.4	4.4	49.9	54.1	54.8	86.6	93.0	73.1	41.2	86.3	38.4	49.2	52.0	53.5	60.8	59.1

4AGT -	91.5	68.5	89.1	43.4	30.1	50.1	48.4	59.8	88.3	92.7	70.7	35.4	88.0	25.7	-	-	49.9	61.9	55.2
4AGT Train	96.1	70.1	88.9	44.4	29.9	50.2	54.3	62.3	88.1	92.7	69.9	41.9	86.9	66.1	-	42.7	54.8	58.4	61.0
4AGT Bus	96.0	70.7	89.2	44.6	33.8	51.8	54.4	60.6	88.6	93.7	69.6	38.6	90.1	56.7	49.5	-	52.0	60.3	61.1
4AGT Both	95.9	70.5	87.5	33.7	25.9	51.0	53.0	57.6	88.3	93.2	70.4	43.1	85.5	35.4	65.2	65.5	55.1	61.0	63.2

The confusion matrices show the same effect from another angle. Before adding the synthetic masks, the model often misclassified missing classes as visually similar labels. After adding the generated instances, this overprediction is reduced. In the figure: a) Synthia baseline, b) Synthia plus synthetic classes, c) 4AGT baseline, d) 4AGT plus synthetic classes.

The full paper is available on MDPI.