Modern Super-Resolution — From ESRGAN to Diffusion Models
From SRGAN to ESRGAN (2018)
SRGAN (Super-Resolution GAN) introduced Generative Adversarial Networks (GANs) to super-resolution in 2016, achieving significant visual quality improvements over traditional PSNR-optimization methods through perceptual loss. However, SRGAN still had room for improvement. ESRGAN (Enhanced Super-Resolution GAN) by Wang et al. in 2018 optimized four key directions.
The RRDB (Residual-in-Residual Dense Block) combines the advantages of residual and dense connections. Dense connections allow each layer to access features from all preceding layers, avoiding feature redundancy, while residual connections stabilize training of deep networks. ESRGAN stacks multiple residual dense blocks, forming a Residual-in-Residual structure—using residual connections at the macro level between blocks and dense connections within each block at the micro level.
Another innovation in ESRGAN is the RaGAN (Relativistic GAN) discriminator. Traditional GAN discriminators only judge whether a single image is real or generated, while RaGAN discriminators compare two images—“is this real image more realistic than that generated image?” This relative discrimination provides richer gradient signals.
The improvement to perceptual loss is also crucial. SRGAN calculates feature distance after VGG network activation, but activation functions (ReLU) cause sparsity—many positions are zero, leading to vanishing gradients. ESRGAN calculates feature distance before activation, preserving more original information.
Finally, ESRGAN removes Batch Normalization. BN is effective in classification tasks but introduces artifacts in image restoration because BN normalizes feature distributions, potentially destroying local statistical properties of images. After removing BN, the network generalizes better and generates more natural images.
Figure 1 - Evolution timeline of super-resolution technology (from SRCNN to diffusion models):
flowchart TD
A["SRCNN (2014)<br/>First CNN-based SR<br/>End-to-end learning"] --> B["SRGAN (2016)<br/>GAN + Perceptual Loss<br/>Visual quality breakthrough"]
B --> C["ESRGAN (2018)<br/>RRDB + RaGAN<br/>Remove Batch Norm"]
C --> D["Real-ESRGAN (2021)<br/>Real-world degradation<br/>High-order degradation chain"]
D --> E["SwinIR (2021)<br/>Swin Transformer<br/>Long-range dependency"]
E --> F["Diffusion Models (2021+)<br/>SR3 / LDM<br/>Ultimate quality"]
classDef milestone fill:#2196F3,color:#fff
class A,B,C,D,E,F milestoneReal-ESRGAN—Challenges of Real-World Images
Training super-resolution models typically requires paired low-resolution and high-resolution images, but real-world low-resolution images often come from complex degradation processes—lens blur, sensor noise, compression artifacts, multiple scaling operations… Real-ESRGAN (2021, Wang et al.)’s core idea is the high-order degradation model.
Traditional degradation models assume images are first blurred then downsampled. Real-ESRGAN’s degradation chain is more complex:
$$ I_{deg} = [(I_{HR} \otimes k) \downarrow_r + n]_{JPEG} $$
This process includes blur (simulated by convolution kernel $k$ for lens and motion blur), downsampling ($\downarrow_r$ means scaling down to $1/r$), adding noise $n$, and JPEG compression. More importantly, this process can repeat multiple times—an image may undergo multiple compressions and rescalings, forming complex artifacts.
Real-ESRGAN also uses sinc filters to simulate real-world degradation artifacts like ringing and overshoot. These artifacts are hard to capture in traditional degradation models but are common in real low-quality images.
The discriminator is also switched to a U-Net structure with spectral normalization to stabilize training. U-Net’s skip connections allow the discriminator to see both local and global features, providing more fine-grained pixel-level feedback. The generator inherits ESRGAN’s architecture but is retrained on real data.
Figure 2 - Real-ESRGAN’s high-order degradation chain (blur → downsample → noise → compression):
flowchart TD
A["High-resolution Image<br/>HR Image"] --> B["Blur Operation<br/>Blur Kernel"]
B --> C["Downsample<br/>Downsample"]
C --> D["Add Noise<br/>Add Noise"]
D --> E["JPEG Compression<br/>JPEG Compression"]
E --> F["Low-resolution Image<br/>LR Image"]
classDef img fill:#9C27B0,color:#fff
classDef op fill:#2196F3,color:#fff
class A,F img
class B,C,D,E opSwinIR—Transformer’s Long-Range Dependencies
Transformers have excelled in natural language processing, but directly applying them to image restoration faces computational complexity issues. Standard self-attention has complexity $O(N^2)$, where $N$ is the number of image pixels—for 256×256 images, this is already unbearable. SwinIR (2021, Liang et al.) solves this problem with the shifted window attention mechanism.
Swin Transformer’s core idea is to divide images into non-overlapping local windows, compute self-attention within each window, reducing complexity to $O(N)$. More cleverly, Swin Transformer shifts window positions in alternating layers, allowing different layers to aggregate information from different regions, ultimately achieving global modeling.
SwinIR’s network architecture consists of three parts: shallow feature extraction uses 3×3 convolutions to extract initial features; deep feature extraction uses multiple Residual Swin Transformer Blocks (RSTB), each containing several Swin Transformer layers and convolutional layers; high-quality image reconstruction uses sub-pixel convolution and convolutional layers to generate the final image.
Transformer’s advantage lies in global modeling capability. Convolutional networks have limited receptive fields and struggle to capture long-range dependencies. SwinIR’s self-attention mechanism can directly model global relationships between pixels, better recovering texture details and structural information. SwinIR achieved SOTA performance on tasks like image denoising, deblurring, and super-resolution.
Diffusion Models—From Noise to Clear Images
Diffusion Models are breakthroughs in image generation and restoration from 2020-2022. Their basic idea consists of forward and reverse processes.
The forward process gradually adds Gaussian noise to clear images until the image becomes pure noise. Each step of noise addition satisfies:
$$ q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I) $$
where $\beta_t$ is the noise schedule parameter controlling how much noise is added per step. After $T$ steps, $x_T$ approximately follows a standard normal distribution.
The reverse process trains neural networks to gradually recover clear images from noise. The network learns to predict the noise that should be removed at each step:
$$ p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t)) $$
In super-resolution, diffusion models input low-resolution images as conditional information to the denoising network. At each iteration, the network references both the low-resolution image’s guidance and the current noisy image state, progressively generating high-resolution details.
Diffusion models’ advantage is that generated results are extremely realistic and can synthesize reasonable texture details. However, the代价 is slow inference speed—multiple iterations (usually 50-1000 steps) are required to complete denoising. Representative works include SR3 (Image Super-Resolution via Iterative Refinement) and LDM (Latent Diffusion Model)—LDM performs the diffusion process in latent space, dramatically improving inference speed.
Figure 3 - Forward and reverse processes of diffusion models (noise addition and iterative denoising):
flowchart TD
A["Clear Image<br/>x₀"] --> B["Add Noise<br/>x₁ → x₂ → ... → x_T"]
B --> C["Pure Noise<br/>x_T ~ N(0,1)"]
C --> D["Denoising Network<br/>Predict Noise"]
D --> E["Remove Noise<br/>x_{T-1} → ... → x₁ → x₀"]
E --> F["Recovered Clear Image<br/>x₀̂"]
classDef img fill:#4CAF50,color:#fff
classDef process fill:#2196F3,color:#fff
classDef noise fill:#FF9800,color:#fff
class A,F img
class B,E process
class C,D noiseMethod Comparison and Selection Guide
Traditional Methods vs Deep Learning Methods
| Dimension | Traditional Methods | Deep Learning Methods |
|---|---|---|
| Compute resource needs | Low, can run on embedded devices | High, typically requires GPU |
| Inference speed | Fast, real-time processing | Slower, diffusion models especially slow |
| Generalization ability | Relies on degradation model assumptions, limited generalization | Data-driven, strong generalization |
| Texture details | Hard to generate realistic textures | Can synthesize natural textures |
| Interpretability | Strong, clear mathematical principles | Weak, black-box models |
| Training needs | No training required | Requires large data and compute |
| Suitable scenarios | Known degradation model, resource-constrained | Complex real-world scenarios, quality-priority |
Method-Specific Use Cases
| Method | Best For | Not Recommended |
|---|---|---|
| Bicubic interpolation | Real-time preview, low quality requirements | Large scaling (>4x) |
| Wiener filter | Deblurring with known PSF, astronomical images | Blind deblurring, unknown noise |
| Bilateral filter | Light denoising, edge preservation | Strong noise, texture regions |
| SRCNN | Fast super-resolution, mobile devices | Texture-rich natural images |
| SRGAN/ESRGAN | Photo enhancement, visual quality priority | Scenes requiring strict fidelity |
| Real-ESRGAN | Real-world low-quality image restoration | Artistic stylization |
| Diffusion models | Ultimate quality, detail synthesis | Real-time applications, resource-constrained |
Evaluation Metrics
Super-resolution evaluation metrics fall into two categories: pixel-level metrics measure reconstruction accuracy, while perceptual-level metrics measure visual quality.
PSNR (Peak Signal-to-Noise Ratio) is the most commonly used pixel-level metric, calculating the mean squared error (MSE) between reconstructed and ground truth images:
$$ \text{PSNR} = 10 \cdot \log_{10}\left(\frac{\text{MAX}_I^2}{\text{MSE}}\right) $$
where $\text{MAX}_I$ is the maximum pixel value (255 for 8-bit images). Higher PSNR is better, but it only focuses on pixel differences without considering human perception characteristics.
SSIM (Structural Similarity Index) measures structural similarity, evaluating from three dimensions: luminance, contrast, and structure. SSIM ranges from -1 to 1, with 1 indicating complete identity. SSIM is more consistent with human perception than PSNR but still tends toward smooth results.
LPIPS (Learned Perceptual Image Patch Similarity) is a perceptual-level metric that extracts features using pre-trained deep networks (such as VGG or AlexNet) and calculates distances in feature space. Lower LPIPS indicates better perceptual quality and better reflects human evaluation of texture and details.
FID (Fréchet Inception Distance) is commonly used to evaluate generation quality, calculating the Fréchet distance between real image distribution and generated image distribution in Inception network feature space. Lower FID indicates the generated distribution is closer to the real distribution.
In practical applications, metrics should be selected based on the scenario. If pursuing pixel-level fidelity, prioritize PSNR and SSIM. If pursuing visual quality, prioritize LPIPS and FID. Diffusion models typically perform best on perceptual metrics but may have lower PSNR than other methods.
Summary
From SRGAN to diffusion models, super-resolution technology has evolved from perceptual loss to adversarial training, from real-world degradation modeling to Transformer global modeling, and from single-step reconstruction to iterative denoising. Each method has its advantages and suitable scenarios:
- ESRGAN provides a good balance between quality and speed
- Real-ESRGAN specializes in real-world low-quality image restoration
- SwinIR achieves powerful global modeling with Transformers
- Diffusion models achieve ultimate visual quality but at high inference cost
When selecting methods, you need to balance quality, speed, computational resources, and application scenarios. For real-time applications, you can choose SRCNN or ESRGAN. For photo enhancement, Real-ESRGAN is a good choice. For scenarios requiring ultimate quality, diffusion models are currently the best option.