Unsupervised Class Generation to Expand Semantic Segmentation Datasets
While working with CARLA-based simulation data, I found a mismatch between the classes present in synthetic datasets and the target labels used by Cityscapes. Some missing labels can be recovered by relabeling existing assets, but others, such as trains, are difficult to add directly because the simulator maps do not contain the required scene structure. I tested whether synthetic object instances could be included in source-domain images and used inside a UDA pipeline to teach the segmentation model those missing classes.
To build the pipeline, I combined Stable Diffusion with SAM. DAAM provides a location prior for the generated object through diffusion attention maps conditioned on the class name. I used that prior to prompt SAM and extract a mask proposal for each generated class instance.

After generation, I curated the masks with three quality metrics to remove noisy samples and build an instance-level dataset.
During training, I included these masks in source images with a CutMix-style procedure and trained DAFormer inside a UDA pipeline. The results show that the model can learn missing classes from overlaid synthetic instances alone. The table reports semantic segmentation mIoU after applying the method to Synthia to Cityscapes and 4AGT to Cityscapes.
Method | Road | Sidewalk | Building | Wall | Fence | Pole | Traffic Light | Traffic Sign | Vegetation | Sky | Person | Rider | Car | Truck | Bus | Train | Motorcycle | Bicycle | mIoU |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Synthia - | 82.4 | 37.7 | 88.7 | 43.0 | 8.4 | 50.8 | 55.7 | 55.1 | 86.0 | 88.1 | 74.2 | 49.5 | 87.8 | - | 63.2 | - | 54.5 | 62.8 | 54.9 |
| Synthia Train | 87.5 | 50.2 | 88.4 | 44.4 | 1.6 | 49.2 | 53.1 | 50.7 | 85.3 | 92.8 | 74.5 | 48.2 | 85.9 | - | 70.6 | 29.7 | 53.4 | 60.1 | 57.0 |
| Synthia Truck | 82.5 | 40.8 | 88.8 | 44.6 | 6.8 | 50.4 | 55.5 | 51.0 | 85.1 | 91.7 | 67.1 | 47.6 | 90.5 | 64.0 | 60.3 | - | 55.8 | 62.2 | 58.0 |
| Synthia Both | 86.5 | 47.1 | 88.3 | 44.4 | 4.4 | 49.9 | 54.1 | 54.8 | 86.6 | 93.0 | 73.1 | 41.2 | 86.3 | 38.4 | 49.2 | 52.0 | 53.5 | 60.8 | 59.1 |
| 4AGT - | 91.5 | 68.5 | 89.1 | 43.4 | 30.1 | 50.1 | 48.4 | 59.8 | 88.3 | 92.7 | 70.7 | 35.4 | 88.0 | 25.7 | - | - | 49.9 | 61.9 | 55.2 |
| 4AGT Train | 96.1 | 70.1 | 88.9 | 44.4 | 29.9 | 50.2 | 54.3 | 62.3 | 88.1 | 92.7 | 69.9 | 41.9 | 86.9 | 66.1 | - | 42.7 | 54.8 | 58.4 | 61.0 |
| 4AGT Bus | 96.0 | 70.7 | 89.2 | 44.6 | 33.8 | 51.8 | 54.4 | 60.6 | 88.6 | 93.7 | 69.6 | 38.6 | 90.1 | 56.7 | 49.5 | - | 52.0 | 60.3 | 61.1 |
| 4AGT Both | 95.9 | 70.5 | 87.5 | 33.7 | 25.9 | 51.0 | 53.0 | 57.6 | 88.3 | 93.2 | 70.4 | 43.1 | 85.5 | 35.4 | 65.2 | 65.5 | 55.1 | 61.0 | 63.2 |
The confusion matrices show the same effect from another angle. Before adding the synthetic masks, the model often misclassified missing classes as visually similar labels. After adding the generated instances, this overprediction is reduced. In the figure: a) Synthia baseline, b) Synthia plus synthetic classes, c) 4AGT baseline, d) 4AGT plus synthetic classes.

The full paper is available on MDPI.
