Novel DIG Metrics for Fairer Image Generation

In an article recently posted to the Meta Research website, researchers introduced decomposed indicators of disparities in image generation (Decomposed-DIG), a set of metrics for separately measuring geographic disparities in objects and backgrounds of generated images.

Study: Uncovering Geographic Bias in Image Generation. Image Credit: metamorworks/Shutterstock
Study: Uncovering Geographic Bias in Image Generation. Image Credit: metamorworks/Shutterstock

By auditing a latent diffusion model, the study revealed that generated objects were more realistic than backgrounds, with significant regional disparities, particularly in Africa. The paper also presented a new prompting structure that improved background diversity in generated images.

Background

Recent advancements in text-to-image generative systems have greatly improved visual content creation and downstream discriminative model training. However, these models often exhibit social biases, especially geographic disparities in image realism and representation diversity.

Prior works identified that generated images frequently represent regions like Africa with stereotypical and inaccurate depictions. Existing evaluation metrics, such as those comparing generated images holistically to real images, fail to attribute these biases to specific image components like objects and backgrounds.

To address these gaps, the paper introduced Decomposed-DIG, a set of metrics that separately measured disparities in object and background depictions in generated images. This innovative approach provided a more detailed analysis of geographic biases, revealing that generated objects are typically more realistic than backgrounds and that backgrounds exhibit greater regional disparities. The study also proposed a prompting technique that significantly improved background diversity, offering a more accurate and representative generative model.

Detailed Benchmarking Protocol for Analyzing Geographic Disparities in Image Generation

The process involved three main steps.

  • Object and background segmentation: Images were segmented into object and background components using the segment anything model (SAM) facilitated by LangSAM. SAM utilized bounding boxes generated by GroundingDINO for object detection, producing precise segmentation masks. Any image regions not identified as objects were categorized as backgrounds, ensuring a clear division for subsequent analysis.
  • Decomposed image features: Vision transformer (ViT) was employed for feature extraction, focusing on object-specific and background-specific patches within the segmented images. ViT's ability to isolate features based on patches allowed for detailed measurements of realism and diversity specific to objects and backgrounds separately. This method contrasted with traditional convolutional neural network (CNN)-based approaches by leveraging patch-level attention scores to refine feature extraction.
  • Object and background-specific measurements: Using ViT features, the protocol calculated precision and coverage metrics separately for object-only ("Obj-only") and background-only ("BG-only") contexts across different geographic regions. This analysis helped in pinpointing disparities more accurately compared to previous holistic evaluations, which considered entire images without segmenting objects and backgrounds.

Decomposed-DIG enhanced the granularity of evaluation by focusing on specific components of generated images, enabling a more detailed assessment of geographic biases. This approach ensured that disparities in realism and representation diversity could be attributed to distinct parts of the image, facilitating targeted improvements in generative models to reduce biases effectively.

Analysis of Geographic Disparities in Generated Images

The authors applied the Decomposed-DIG to analyze geographic biases in the widely used latent diffusion model (LDM) 1.5.3. They focused on dissecting the disparities between object and background components in generated images across different geographic regions.

Initially, it was found that objects generally exhibited higher realism compared to backgrounds, as indicated by higher precision scores in Obj-only evaluations than in BG-only evaluations. This disparity suggested that while generated objects aligned more closely with real counterparts, backgrounds often depicted settings less representative of real-world diversity, such as rural scenes in Africa or historical architecture in Europe.

Furthermore, the analysis revealed that backgrounds displayed significantly larger geographic disparities than objects. Coverage metrics in BG-only setups varied notably across regions, indicating a broader range of representation diversity issues compared to objects. The researchers substantiated these findings with qualitative insights into generation patterns, identifying specific failure modes where the LDM struggled to depict diverse backgrounds or realistic objects in certain regions. For instance, backgrounds in Africa may lack diversity in neutral scenes, while objects like modern vehicles were inadequately represented.

Early Mitigations via New Prompt Template

To address regional disparities in generated images, the researchers explored using adjective descriptors in prompts, such as, “European bag”, instead of noun-based descriptors, like, “bag in Europe”. Results showed that this new prompting template significantly improved background diversity by 52% for the worst-performing region and 20% on average, with minimal impact on object realism and diversity. Adjective-based prompts resulted in more varied and neutral backgrounds, reducing stereotypical representations. This approach led to a slight improvement in background realism for the worst-performing group and overall improvements in object depiction.

Conclusion

In conclusion, the researchers introduced Decomposed-DIG as a benchmarking tool to uncover geographic disparities in text-to-image models, focusing on object and background components. They highlighted that backgrounds exhibit larger regional disparities than objects, impacting realism and diversity in generated images.

The authors identified specific model shortcomings, such as inadequate depiction of object diversity in Africa and unrealistic backgrounds in Europe. By proposing a new prompting strategy based on adjectives, the study demonstrated significant improvements in background diversity without compromising object realism. These findings showed the importance of detailed evaluation metrics and targeted mitigations to enhance the accuracy and inclusivity of generative visual content.

Journal reference:
Soham Nandi

Written by

Soham Nandi

Soham Nandi is a technical writer based in Memari, India. His academic background is in Computer Science Engineering, specializing in Artificial Intelligence and Machine learning. He has extensive experience in Data Analytics, Machine Learning, and Python. He has worked on group projects that required the implementation of Computer Vision, Image Classification, and App Development.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Nandi, Soham. (2024, June 27). Novel DIG Metrics for Fairer Image Generation. AZoAi. Retrieved on October 08, 2024 from https://www.azoai.com/news/20240627/Novel-DIG-Metrics-for-Fairer-Image-Generation.aspx.

  • MLA

    Nandi, Soham. "Novel DIG Metrics for Fairer Image Generation". AZoAi. 08 October 2024. <https://www.azoai.com/news/20240627/Novel-DIG-Metrics-for-Fairer-Image-Generation.aspx>.

  • Chicago

    Nandi, Soham. "Novel DIG Metrics for Fairer Image Generation". AZoAi. https://www.azoai.com/news/20240627/Novel-DIG-Metrics-for-Fairer-Image-Generation.aspx. (accessed October 08, 2024).

  • Harvard

    Nandi, Soham. 2024. Novel DIG Metrics for Fairer Image Generation. AZoAi, viewed 08 October 2024, https://www.azoai.com/news/20240627/Novel-DIG-Metrics-for-Fairer-Image-Generation.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Weed Classification in Precision Farming