Oct 19 2025
As AI image generation races ahead, this landmark survey delivers the long-needed roadmap, defining how we should judge fidelity, alignment, and human perception in the next era of visual intelligence.

Review: Image generation evaluation: a comprehensive survey of human and automatic evaluations. Image Credit: sizsus art / Shutterstock
In the field of image generation, where models have achieved remarkable progress and are widely applied in medicine, fashion, and media, evaluation methods have failed to keep pace with model development—creating a critical disconnect. On one hand, existing evaluation research is far less abundant than image generation research itself, with a severe imbalance in the number of related studies over the past decade. On the other hand, image generation tasks are highly diverse (such as text-to-image, sketch-to-image, and layout-to-image), each requiring distinct evaluation focuses, yet there is no unified protocol to guide the selection of evaluation aspects.
Additionally, the two core evaluation approaches—human and automatic—face their own limitations: human evaluation, though recognized as the "gold standard" for aligning with human perception, is time-consuming, costly, and prone to inconsistencies due to individual differences among evaluators; automatic evaluation, while efficient and objective, often fails to capture fine-grained human perceptual nuances (e.g., semantic alignment in complex scenes), leading to discrepancies with human judgments. Existing surveys also lack comprehensiveness—either focusing solely on automatic evaluation, limiting to specific tasks, or providing only brief overviews—failing to cover both human and automatic evaluation across diverse tasks.
A Collaborative and Comprehensive Survey
A research team composed of researchers from the School of Software Technology at Zhejiang University (Ningbo), the School of Computer Science and Engineering at Southeast University, and the College of Computer Science and Technology at Zhejiang University has conducted a study entitled "Image generation evaluation: a comprehensive survey of human and automatic evaluations".
Framework for Task-Specific Evaluation
This study presents a systematic and comprehensive survey of image generation evaluation, addressing the aforementioned gaps. First, it categorizes image generation into 10 distinct tasks based on input conditions (including image-to-image, sketch-to-image, text-to-image, and few-shot image generation), clarifying the unique evaluation aspects of each task. For example, text-to-image generation prioritizes semantic alignment between text and images, while layout-to-image generation focuses on fidelity to layout structures and object recognizability.
Second, the study proposes a novel evaluation protocol encompassing six common core evaluation aspects (fidelity, consistency, recognizability, diversity, overall quality, and user preference), specifying which aspects are applicable to human and automatic evaluation for each task, providing a unified reference for researchers.
First In-Depth Review of Human Evaluation
Notably, this survey offers the first in-depth analysis of human evaluation in image generation, covering evaluation methods (absolute evaluation like scale scoring, comparative evaluation such as pairwise preference selection), tools (protocols, crowdsourcing platforms like Amazon Mechanical Turk, and benchmarks), key details (number of evaluators, professional background, experiment time), and data analysis methods (inter-annotator agreement metrics like Cronbach's alpha, mean opinion score (MOS)).
Automatic Metrics and Emerging Benchmarks
For automatic evaluation, it systematically reviews both classic metrics (e.g., pixel-level PSNR, distribution-level FID, perceptual-level LPIPS) and recent task-specific innovations (e.g., CLIP score for text-to-image alignment, SceneFID for multi-object scenes), as well as emerging evaluation benchmarks (e.g., PaintSkills for testing visual reasoning, TIFA for fine-grained semantic consistency via VQA).
Finally, the study discusses current challenges (e.g., lack of universal benchmarks, inconsistencies between human and automatic evaluation) and future directions (e.g., developing domain-specific metrics, integrating psychometrics into human evaluation).
Paper Details and Authors
The paper "Image generation evaluation: a comprehensive survey of human and automatic evaluations" is authored by Qi LIU, Shuanglin YANG, Zejian LI, Lefan HOU, Chenye MENG, Ying ZHANG, and Lingyun SUN. The full text of the paper is available at https://link.springer.com/article/10.1631/FITEE.2400904.