SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation

1CMU   2Nanyang Technological University   3Dongguk University
Accepted to CVPR 2024

Stable Diffusion perpetuates harmful stereotypes that assume dirty buildings are representative of some nations, and often generates regionally irrelevant designs. Our approach decreases stereotypes and improves cultural relevance of generated images and achieves around 80% preferences in our human evaluation across 5 cultures.


Accurate representation in media is known to improve the well-being of the people who consume it. Generative image models trained on large web-crawled datasets such as LAION are known to produce images with harmful stereotypes and misrepresentations of cultures. We improve inclusive representation in generated images by (1) engaging with communities to collect a culturally representative dataset that we call the Cross-Cultural Understanding Benchmark (CCUB), and we propose (2) a novel Self-Contrastive Fine-Tuning (SCoFT) method that leverages the model's known biases to self-improve. SCoFT is designed to encode high-level information from the dataset into the model for the purpose of shifting away from misrepresentations of a culture. Our user study conducted on 51 participants from 5 different countries based on their self-selected national cultural affiliation shows that our proposed approach consistently generates images with higher cultural relevance and fewer stereotypes when compared to the Stable Diffusion baseline.


Unlike concept editing tasks with specific image editing directions, depicting cultural accuracy remains more abstract and challenging. SCoFT leverages the pre-trained model's cultural misrepresentations to refine itself. We harness the intrinsic biases of large pre-trained models as a rich source of counterexamples; shifting away from these biases gives the model clues towards more accurate cultural concepts. Image samples from the pre-trained model are used as negative examples, and CCUB images are used as positive examples, to train the model to discern subtle differences. We de-noise latent codes in several iterations, project them into the pixel space, and then compute the contrastive loss. To prevent overfitting for small dataset fine-tuning, a memorization loss is further introduced.


Culturally-aware SCoFT Results


To investigate the effects of each loss function within SCoFT we also qualitatively compare each ablation in the left figure. Human evaluation results is shown in the violin plot of participant rankings across the survey items and countries. A wider strip means more answers with that value. Each new loss in our ablation study improved the rankings, and our whole pipeline is best. (Rank 1 is the best; 4, the worst)



Potential Applications

SCoFT's effectiveness reaches far beyond demographical cultural applications. We demonstrate its adaptability by applying it to another fine-tuning domain: our internal prosthetics dataset. Here, SCoFT has shown to be particularly effective in generating images that more accurately represent the culture of people with prosthetics.

mobility scoft


To tackle the bias in the data, we aim for two goals: (1) to generate accurate images given a specific cultural context and (2) to generate diverse images given a generic text prompt without any specific cultural context. Our current approach is focused on achieving the first goal. Our current model can generate promisingly diverse images for some generic prompts "photo of a person" as shown in the figure below when compared to the baseline model that generates biased images. Our CCUB dataset was collected by experienced residents; however, to improve the quality of the dataset, more vigorous verification will be needed. We strongly encourage and invite everyone to participate in enriching the CCUB dataset.

diversity in generic prompts

Citation & BibTeX

Zhixuan Liu, Peter Schaldenbrand, Beverley-Claire Okogwu, Wenxuan Peng, Youngsik Yun, Andrew Hundt, Jihie Kim, and Jean Oh, "SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation." The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) 2024.

        title={SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation}, 
        author={Zhixuan Liu and Peter Schaldenbrand and Beverley-Claire Okogwu and Wenxuan Peng and Youngsik Yun and Andrew Hundt and Jihie Kim and Jean Oh},