Visual Prompt Learning Drives Next-Generation AI-Generated Content Across Modalities

From smarter image generation to efficient model tuning, visual prompt learning is reshaping AI-generated content, this review maps the landscape and reveals where the field is headed next.

Review: Prompt learning in computer vision: a survey. ​​​​​​​Image Credit: TA design / Shutterstock​​​​​​​Review: Prompt learning in computer vision: a survey. ​​​​​​​Image Credit: TA design / Shutterstock

Researchers from Fudan University published a review article in the special issue "Latest Advances in Artificial Intelligence Generated Content" of Frontiers of Information Technology & Electronic Engineering Vol. 25, No. 1, 2024. Based on the close relationship between vision and language information built by vision-language models (VLMs), prompt learning has become crucial in many important application fields. This article provides a progressive and comprehensive review of visual prompt learning related to artificial intelligence-generated content (AIGC).

The core of vision-language pre-training is to learn cross-modal alignment through large-scale image-text pairs. As a representative model, Contrastive Language-Image Pre-training (CLIP) trains image and text encoders using contrastive learning, enabling zero-shot/few-shot learning and image-text retrieval. A Large-scale ImaGe and Noisy-text embedding (ALIGN) enhances model robustness through larger-scale data, while Bootstrapping Language-Image Pre-training (BLIP) and Vision-and-Language Transformer (ViLT) optimize pre-training efficiency through bootstrapping caption generation and lightweight architecture, respectively. These models provide robust feature encoders for prompt learning.

Visual prompt learning is divided into two categories: language prompting and visual prompting. Language prompting enhances the model's generalization ability to new categories through learnable text contexts (such as CoOp and CoCoOp), while Prompt Learning with Optimal Transport (PLOT) achieves multi-prompt alignment through optimal transport to handle complex attributes. Visual prompting adapts pre-trained models through image perturbations or masks (such as VP and MAE-VQGAN), and Class-Aware Visual Prompt Tuning (CAVPT) and Iterative Label Mapping-based Visual Prompting (ILM-VP) further combine text and visual features to improve task performance. In addition, multi-modal prompting (such as MaPLe and Instruction-ViT) integrates language and visual information, showing stronger performance in cross-modal tasks.

Prompt-guided generative models are mainly diffusion models, which combine VLMs to achieve semantically controllable image generation, editing, and inpainting. Stable-Diffusion and Imagen reduce computational costs through latent space diffusion and support multi-modal prompts such as text and masks; ControlNet and DreamBooth enhance model flexibility through conditional control and few-shot fine-tuning. In image editing tasks, SmartBrush and Blended-Diff achieve high-precision inpainting using text and mask prompts, demonstrating the key role of prompts in interactive generation.

Prompt tuning aims to adapt large-scale models to downstream tasks efficiently. In ViTs, Visual Prompt Tuning (VPT) achieves parameter-efficient fine-tuning through learnable prompts in the input layer or intermediate layers, and Long-Tailed Prompt Tuning (LPT) optimizes prompt strategies for long-tailed classification. In VLMs, TCM, and V-VL enhance cross-modal interaction through dual-modal prompt generators, while LoRA and Adapter series reduce training parameters through low-rank adaptation or lightweight modules. These methods significantly reduce computational costs while maintaining model performance, making them suitable for resource-constrained scenarios.

Future research directions include: 1) Alleviating domain shifts and improving prompt interpretability in image classification; 2) Optimizing models such as SAM by incorporating domain knowledge in semantic segmentation; 3) Reducing the distribution difference between base classes and new classes in open-vocabulary object detection; 4) Building task-prompt associations in multi-task learning; 5) Introducing Chain-of-Thought (CoT) to enhance multi-step reasoning; 6) Applying prompts in professional fields such as medical imaging and weather forecasting; 7) Designing cross-view robust visual prompts in gait recognition. These directions will promote the development of prompt learning towards more general and interpretable AI systems.

Source:
Journal reference:

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Ottocast Unveils NanoAI to Revolutionize In-Car Experience With AI-Driven Smart Cockpit Solutions