Modern Super-Resolution — From ESRGAN to Diffusion Models

From SRGAN to ESRGAN (2018)

SRGAN (Super-Resolution GAN) introduced Generative Adversarial Networks (GANs) to super-resolution in 2016, achieving significant visual quality improvements over traditional PSNR-optimization methods through perceptual loss. However, SRGAN still had room for improvement. ESRGAN (Enhanced Super-Resolution GAN) by Wang et al. in 2018 optimized four key directions.

The RRDB (Residual-in-Residual Dense Block) combines the advantages of residual and dense connections. Dense connections allow each layer to access features from all preceding layers, avoiding feature redundancy, while residual connections stabilize training of deep networks. ESRGAN stacks multiple residual dense blocks, forming a Residual-in-Residual structure—using residual connections at the macro level between blocks and dense connections within each block at the micro level.

Another innovation in ESRGAN is the RaGAN (Relativistic GAN) discriminator. Traditional GAN discriminators only judge whether a single image is real or generated, while RaGAN discriminators compare two images—“is this real image more realistic than that generated image?” This relative discrimination provides richer gradient signals.

The improvement to perceptual loss is also crucial. SRGAN calculates feature distance after VGG network activation, but activation functions (ReLU) cause sparsity—many positions are zero, leading to vanishing gradients. ESRGAN calculates feature distance before activation, preserving more original information.

Finally, ESRGAN removes Batch Normalization. BN is effective in classification tasks but introduces artifacts in image restoration because BN normalizes feature distributions, potentially destroying local statistical properties of images. After removing BN, the network generalizes better and generates more natural images.

Figure 1 - Evolution timeline of super-resolution technology (from SRCNN to diffusion models):

mermaid
flowchart TD
    A["SRCNN (2014)<br/>First CNN-based SR<br/>End-to-end learning"] --> B["SRGAN (2016)<br/>GAN + Perceptual Loss<br/>Visual quality breakthrough"]
    B --> C["ESRGAN (2018)<br/>RRDB + RaGAN<br/>Remove Batch Norm"]
    C --> D["Real-ESRGAN (2021)<br/>Real-world degradation<br/>High-order degradation chain"]
    D --> E["SwinIR (2021)<br/>Swin Transformer<br/>Long-range dependency"]
    E --> F["Diffusion Models (2021+)<br/>SR3 / LDM<br/>Ultimate quality"]

    classDef milestone fill:#2196F3,color:#fff
    class A,B,C,D,E,F milestone

Real-ESRGAN—Challenges of Real-World Images

Training super-resolution models typically requires paired low-resolution and high-resolution images, but real-world low-resolution images often come from complex degradation processes—lens blur, sensor noise, compression artifacts, multiple scaling operations… Real-ESRGAN (2021, Wang et al.)’s core idea is the high-order degradation model.

Traditional degradation models assume images are first blurred then downsampled. Real-ESRGAN’s degradation chain is more complex:

$$ I_{deg} = [(I_{HR} \otimes k) \downarrow_r + n]_{JPEG} $$

This process includes blur (simulated by convolution kernel $k$ for lens and motion blur), downsampling ($\downarrow_r$ means scaling down to $1/r$), adding noise $n$, and JPEG compression. More importantly, this process can repeat multiple times—an image may undergo multiple compressions and rescalings, forming complex artifacts.

Real-ESRGAN also uses sinc filters to simulate real-world degradation artifacts like ringing and overshoot. These artifacts are hard to capture in traditional degradation models but are common in real low-quality images.

The discriminator is also switched to a U-Net structure with spectral normalization to stabilize training. U-Net’s skip connections allow the discriminator to see both local and global features, providing more fine-grained pixel-level feedback. The generator inherits ESRGAN’s architecture but is retrained on real data.

Figure 2 - Real-ESRGAN’s high-order degradation chain (blur → downsample → noise → compression):

mermaid
flowchart TD
    A["High-resolution Image<br/>HR Image"] --> B["Blur Operation<br/>Blur Kernel"]
    B --> C["Downsample<br/>Downsample"]
    C --> D["Add Noise<br/>Add Noise"]
    D --> E["JPEG Compression<br/>JPEG Compression"]
    E --> F["Low-resolution Image<br/>LR Image"]

    classDef img fill:#9C27B0,color:#fff
    classDef op fill:#2196F3,color:#fff
    class A,F img
    class B,C,D,E op

SwinIR—Transformer’s Long-Range Dependencies

Transformers have excelled in natural language processing, but directly applying them to image restoration faces computational complexity issues. Standard self-attention has complexity $O(N^2)$, where $N$ is the number of image pixels—for 256×256 images, this is already unbearable. SwinIR (2021, Liang et al.) solves this problem with the shifted window attention mechanism.

Swin Transformer’s core idea is to divide images into non-overlapping local windows, compute self-attention within each window, reducing complexity to $O(N)$. More cleverly, Swin Transformer shifts window positions in alternating layers, allowing different layers to aggregate information from different regions, ultimately achieving global modeling.

SwinIR’s network architecture consists of three parts: shallow feature extraction uses 3×3 convolutions to extract initial features; deep feature extraction uses multiple Residual Swin Transformer Blocks (RSTB), each containing several Swin Transformer layers and convolutional layers; high-quality image reconstruction uses sub-pixel convolution and convolutional layers to generate the final image.

Transformer’s advantage lies in global modeling capability. Convolutional networks have limited receptive fields and struggle to capture long-range dependencies. SwinIR’s self-attention mechanism can directly model global relationships between pixels, better recovering texture details and structural information. SwinIR achieved SOTA performance on tasks like image denoising, deblurring, and super-resolution.

Diffusion Models—From Noise to Clear Images

Diffusion Models are breakthroughs in image generation and restoration from 2020-2022. Their basic idea consists of forward and reverse processes.

The forward process gradually adds Gaussian noise to clear images until the image becomes pure noise. Each step of noise addition satisfies:

$$ q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I) $$

where $\beta_t$ is the noise schedule parameter controlling how much noise is added per step. After $T$ steps, $x_T$ approximately follows a standard normal distribution.

The reverse process trains neural networks to gradually recover clear images from noise. The network learns to predict the noise that should be removed at each step:

$$ p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t)) $$

In super-resolution, diffusion models input low-resolution images as conditional information to the denoising network. At each iteration, the network references both the low-resolution image’s guidance and the current noisy image state, progressively generating high-resolution details.

Diffusion models’ advantage is that generated results are extremely realistic and can synthesize reasonable texture details. However, the代价 is slow inference speed—multiple iterations (usually 50-1000 steps) are required to complete denoising. Representative works include SR3 (Image Super-Resolution via Iterative Refinement) and LDM (Latent Diffusion Model)—LDM performs the diffusion process in latent space, dramatically improving inference speed.

Figure 3 - Forward and reverse processes of diffusion models (noise addition and iterative denoising):

mermaid
flowchart TD
    A["Clear Image<br/>x₀"] --> B["Add Noise<br/>x₁ → x₂ → ... → x_T"]
    B --> C["Pure Noise<br/>x_T ~ N(0,1)"]
    C --> D["Denoising Network<br/>Predict Noise"]
    D --> E["Remove Noise<br/>x_{T-1} → ... → x₁ → x₀"]
    E --> F["Recovered Clear Image<br/>x₀̂"]

    classDef img fill:#4CAF50,color:#fff
    classDef process fill:#2196F3,color:#fff
    classDef noise fill:#FF9800,color:#fff
    class A,F img
    class B,E process
    class C,D noise

Method Comparison and Selection Guide

Traditional Methods vs Deep Learning Methods

DimensionTraditional MethodsDeep Learning Methods
Compute resource needsLow, can run on embedded devicesHigh, typically requires GPU
Inference speedFast, real-time processingSlower, diffusion models especially slow
Generalization abilityRelies on degradation model assumptions, limited generalizationData-driven, strong generalization
Texture detailsHard to generate realistic texturesCan synthesize natural textures
InterpretabilityStrong, clear mathematical principlesWeak, black-box models
Training needsNo training requiredRequires large data and compute
Suitable scenariosKnown degradation model, resource-constrainedComplex real-world scenarios, quality-priority

Method-Specific Use Cases

MethodBest ForNot Recommended
Bicubic interpolationReal-time preview, low quality requirementsLarge scaling (>4x)
Wiener filterDeblurring with known PSF, astronomical imagesBlind deblurring, unknown noise
Bilateral filterLight denoising, edge preservationStrong noise, texture regions
SRCNNFast super-resolution, mobile devicesTexture-rich natural images
SRGAN/ESRGANPhoto enhancement, visual quality priorityScenes requiring strict fidelity
Real-ESRGANReal-world low-quality image restorationArtistic stylization
Diffusion modelsUltimate quality, detail synthesisReal-time applications, resource-constrained

Evaluation Metrics

Super-resolution evaluation metrics fall into two categories: pixel-level metrics measure reconstruction accuracy, while perceptual-level metrics measure visual quality.

PSNR (Peak Signal-to-Noise Ratio) is the most commonly used pixel-level metric, calculating the mean squared error (MSE) between reconstructed and ground truth images:

$$ \text{PSNR} = 10 \cdot \log_{10}\left(\frac{\text{MAX}_I^2}{\text{MSE}}\right) $$

where $\text{MAX}_I$ is the maximum pixel value (255 for 8-bit images). Higher PSNR is better, but it only focuses on pixel differences without considering human perception characteristics.

SSIM (Structural Similarity Index) measures structural similarity, evaluating from three dimensions: luminance, contrast, and structure. SSIM ranges from -1 to 1, with 1 indicating complete identity. SSIM is more consistent with human perception than PSNR but still tends toward smooth results.

LPIPS (Learned Perceptual Image Patch Similarity) is a perceptual-level metric that extracts features using pre-trained deep networks (such as VGG or AlexNet) and calculates distances in feature space. Lower LPIPS indicates better perceptual quality and better reflects human evaluation of texture and details.

FID (Fréchet Inception Distance) is commonly used to evaluate generation quality, calculating the Fréchet distance between real image distribution and generated image distribution in Inception network feature space. Lower FID indicates the generated distribution is closer to the real distribution.

In practical applications, metrics should be selected based on the scenario. If pursuing pixel-level fidelity, prioritize PSNR and SSIM. If pursuing visual quality, prioritize LPIPS and FID. Diffusion models typically perform best on perceptual metrics but may have lower PSNR than other methods.

Summary

From SRGAN to diffusion models, super-resolution technology has evolved from perceptual loss to adversarial training, from real-world degradation modeling to Transformer global modeling, and from single-step reconstruction to iterative denoising. Each method has its advantages and suitable scenarios:

  • ESRGAN provides a good balance between quality and speed
  • Real-ESRGAN specializes in real-world low-quality image restoration
  • SwinIR achieves powerful global modeling with Transformers
  • Diffusion models achieve ultimate visual quality but at high inference cost

When selecting methods, you need to balance quality, speed, computational resources, and application scenarios. For real-time applications, you can choose SRCNN or ESRGAN. For photo enhancement, Real-ESRGAN is a good choice. For scenarios requiring ultimate quality, diffusion models are currently the best option.