SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation

Zhixuan Liu¹, Peter Schaldenbrand¹, Beverley-Claire Okogwu¹, Wenxuan Peng²
Youngsik Yun³, Andrew Hundt¹, Jihie Kim³, Jean Oh¹

¹CMU ²Nanyang Technological University ³Dongguk University

Accepted to CVPR 2024

arXiv Code

Stable Diffusion perpetuates harmful stereotypes that assume dirty buildings are representative of some nations, and often generates regionally irrelevant designs. Our approach decreases stereotypes and improves cultural relevance of generated images and achieves around 80% preferences in our human evaluation across 5 cultures.

Abstract

Accurate representation in media is known to improve the well-being of the people who consume it. Generative image models trained on large web-crawled datasets such as LAION are known to produce images with harmful stereotypes and misrepresentations of cultures. We improve inclusive representation in generated images by (1) engaging with communities to collect a culturally representative dataset that we call the Cross-Cultural Understanding Benchmark (CCUB), and we propose (2) a novel Self-Contrastive Fine-Tuning (SCoFT) method that leverages the model's known biases to self-improve. SCoFT is designed to encode high-level information from the dataset into the model for the purpose of shifting away from misrepresentations of a culture. Our user study conducted on 51 participants from 5 different countries based on their self-selected national cultural affiliation shows that our proposed approach consistently generates images with higher cultural relevance and fewer stereotypes when compared to the Stable Diffusion baseline.

Algorithm

Unlike concept editing tasks with specific image editing directions, depicting cultural accuracy remains more abstract and challenging. SCoFT leverages the pre-trained model's cultural misrepresentations to refine itself. We harness the intrinsic biases of large pre-trained models as a rich source of counterexamples; shifting away from these biases gives the model clues towards more accurate cultural concepts. Image samples from the pre-trained model are used as negative examples, and CCUB images are used as positive examples, to train the model to discern subtle differences. We de-noise latent codes in several iterations, project them into the pixel space, and then compute the contrastive loss. To prevent overfitting for small dataset fine-tuning, a memorization loss is further introduced.

Culturally-aware SCoFT Results

Nigerian culture: "Nigerian people in casual clothing nowadays", "dancers are performing for a crowd, in Nigeria",
"family is eating together, in Nigeria", "photo of a house, in Nigeria", "photo of a street, in Nigeria",
"photo of a bedroom, in Nigeria", "student studying in the classroom, in Nigeria".

Chinese culture: "people are performing traditional instrument, in China", "photo of a school, in China",
"photo of a street, in China", "family is eating together, in China", "two girls wearing Chinese traditional Han dress",
"a man and a woman, in China", ``woman is painting in a traditional style, in China".

Indian culture: "photo of children in India", "photo of a house, in India",
"people wearing traditional clothing, in India", "family is eating together, in India", "photo of a street, in India",
"people walking on the street, in India", "people inside their house, in India".

Korean culture: "photo of a street, in Korea", "photo of a traditional building, in Korea",
"people wearing traditional clothing, in Korea", "a table of food in Korea", "a woman is painting in a traditional style, in Korea",
"musician performing Korean traditional instrument", "photo of a family, in Korea".

Mexican culture: "photo of a building, in Mexico", "people wearing traditional clothing, in Mexico",
"photo of a family, in Mexico", "photo of a school, in Mexico", "university student studying, in Mexico",
"people performing traditional music instrument, in Mexico", "family is eating together, in Mexico".

Ablation

To investigate the effects of each loss function within SCoFT we also qualitatively compare each ablation in the left figure. Human evaluation results is shown in the violin plot of participant rankings across the survey items and countries. A wider strip means more answers with that value. Each new loss in our ablation study improved the rankings, and our whole pipeline is best. (Rank 1 is the best; 4, the worst)

Potential Applications

SCoFT's effectiveness reaches far beyond demographical cultural applications. We demonstrate its adaptability by applying it to another fine-tuning domain: our internal prosthetics dataset. Here, SCoFT has shown to be particularly effective in generating images that more accurately represent the culture of people with prosthetics.

Limitations

To tackle the bias in the data, we aim for two goals: (1) to generate accurate images given a specific cultural context and (2) to generate diverse images given a generic text prompt without any specific cultural context. Our current approach is focused on achieving the first goal. Our current model can generate promisingly diverse images for some generic prompts "photo of a person" as shown in the figure below when compared to the baseline model that generates biased images. Our CCUB dataset was collected by experienced residents; however, to improve the quality of the dataset, more vigorous verification will be needed. We strongly encourage and invite everyone to participate in enriching the CCUB dataset.

Citation & BibTeX

Zhixuan Liu, Peter Schaldenbrand, Beverley-Claire Okogwu, Wenxuan Peng, Youngsik Yun, Andrew Hundt, Jihie Kim, and Jean Oh, "SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation." The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) 2024.

@misc{liu2024scoft,
        title={SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation}, 
        author={Zhixuan Liu and Peter Schaldenbrand and Beverley-Claire Okogwu and Wenxuan Peng and Youngsik Yun and Andrew Hundt and Jihie Kim and Jean Oh},
        year={2024},
        eprint={2401.08053},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
  }