Deep Learning Super-Resolution — SRCNN and SRGAN

From Mathematical Models to Data-Driven Learning

In super-resolution tasks, traditional methods rely on carefully designed mathematical models—interpolation algorithms, sparse representation, prior constraints, and so on. But deep learning brought a paradigm shift: directly learning the mapping from degraded space to clear space from massive paired low-resolution (LR) and high-resolution (HR) image data.

The core idea is simple: train a neural network $F_{\theta}$ that can map low-resolution images to high-resolution images:

$$ \hat{I}{HR} = F{\theta}(I_{LR}) $$

The training process optimizes network parameters $\theta$ by minimizing the difference between predictions and real high-resolution images:

$$ \theta^* = \arg\min_{\theta} \sum_{i} \mathcal{L}(F_{\theta}(I_{LR}^{(i)}), I_{HR}^{(i)}) $$

No need to manually design filters or priors—let the network learn how to recover details from data itself.

SRCNN: Three-Layer Foundation

SRCNN (Super-Resolution Convolutional Neural Network), published by Dong et al. at ECCV 2014, is the foundational work of deep learning super-resolution. The paper’s core insight: the three traditional steps of super-resolution (patch extraction, nonlinear mapping, reconstruction) can be directly implemented with a three-layer convolutional network.

The network structure is clear and straightforward:

Figure 1 - SRCNN network architecture (three convolutional layers directly correspond to the three traditional steps of super-resolution):

mermaid
flowchart TD
    INPUT["Input LR Image<br/>Upscaled by interpolation<br/>to target size"] --> FE["Feature Extraction<br/>9x9 Conv + ReLU<br/>Extract edges, textures"]
    FE --> NLM["Non-Linear Mapping<br/>1x1 Conv + ReLU<br/>Map to high-dim feature space"]
    NLM --> REC["Reconstruction<br/>5x5 Conv<br/>Aggregate features<br/>to generate HR image"]
    REC --> OUTPUT["Output HR Image"]

    classDef io fill:#2196F3,color:#fff
    classDef layer fill:#9C27B0,color:#fff
    class INPUT,OUTPUT io
    class FE,NLM,REC layer

Feature Extraction Layer

The first layer is a $9 \times 9$ convolution kernel with ReLU activation. Convolution slides a window across the image, extracting local features. The large kernel size ($9 \times 9$) captures larger contextual information. ReLU introduces nonlinearity, enabling the network to learn more complex features.

This layer extracts low-level features like edges and textures from the interpolated low-resolution image—these are the foundation for detail reconstruction.

Non-Linear Mapping Layer

The second layer is $1 \times 1$ convolution with ReLU. $1 \times 1$ convolution doesn’t change spatial dimensions but can change channel count—equivalent to performing linear combination of features at each pixel position, mapping to a higher-dimensional feature space.

This is the key to the network’s “understanding” of images: mapping low-dimensional pixel values to a high-dimensional semantic feature space, providing richer representations for reconstruction.

Reconstruction Layer

The third layer is $5 \times 5$ convolution without activation function. This layer aggregates high-dimensional features back into image space, generating the final high-resolution image.

Loss Function

SRCNN uses pixel-level mean squared error (MSE) as the loss function:

$$ \mathcal{L}{MSE} = \frac{1}{N} \sum{i=1}^{N} | F_{\theta}(I_{LR}^{(i)}) - I_{HR}^{(i)} |_2^2 $$

MSE measures the difference between predicted and real images at each pixel position—optimizing MSE means learning to generate pixel values as close as possible to the real image.

Limitations and Historical Significance

SRCNN’s network is shallow (only 3 layers), with limited receptive field, making it difficult to capture long-range dependencies. Additionally, it requires bicubic interpolation upsampling first, which is computationally inefficient. The upscaling factor is also fixed, not flexible.

But SRCNN’s historical significance cannot be overlooked: it proved that deep learning methods are effective for super-resolution tasks, paving the way for subsequent research. Before SRCNN, super-resolution mainly relied on traditional methods like sparse representation and prior constraints.

SRGAN: New Dimensions of Perceptual Quality

SRGAN (Super-Resolution Generative Adversarial Network), published by Ledig et al. at CVPR 2017, introduced Generative Adversarial Networks (GAN) to super-resolution tasks for the first time. SRGAN’s two core innovations—Perceptual Loss and Adversarial Loss—fundamentally changed the quality evaluation standard for super-resolution.

Figure 2 - SRGAN training pipeline (adversarial game between generator and discriminator, combined with VGG perceptual loss):

mermaid
flowchart TD
    LR["Low-Resolution<br/>Image"] --> G["Generator G<br/>ResNet + Sub-pixel Conv"]
    G --> SR["Super-Resolution<br/>Image"]
    SR --> D["Discriminator D<br/>Judge Realism"]
    SR --> VGG["VGG Feature Extraction<br/>Compute Perceptual Loss"]
    G --> VGG
    HR["Real High-Resolution<br/>Image"] --> D
    HR --> VGG

    D --> LOSS_D["Adversarial Loss<br/>L_adv"]

    VGG --> LOSS_P["Perceptual Loss<br/>L_perceptual"]

    classDef io fill:#2196F3,color:#fff
    classDef net fill:#9C27B0,color:#fff
    classDef loss fill:#FF9800,color:#fff
    class LR,HR io
    class G,D,VGG net
    class LOSS_D,LOSS_P loss

Perceptual Loss

SRGAN’s first innovation is perceptual loss. Traditional methods (including SRCNN) optimize MSE in pixel space, but this has a problem: MSE tends to generate “average” pixel values, resulting in blurry outputs.

Why? Consider an edge region: the real image’s edge might be a sharp black-white boundary (pixel values 0 or 255), but if the network predicts intermediate values (like 128), MSE only increases slightly—because the error is $|128-0|^2 = 16384$ and $|128-255|^2 = 16129$, and the network’s average prediction might be 128 to minimize total error.

Perceptual loss takes a different approach: don’t compute distance in pixel space, but in VGG network’s feature space. VGG is a classification network pre-trained on ImageNet, and its intermediate layer features encode semantic information about the image (texture, shape, etc.).

$$ \mathcal{L}{perceptual} = | \phi(I{HR}) - \phi(\hat{I}_{HR}) |_2^2 $$

Where $\phi$ is the feature extractor of a pre-trained VGG network’s intermediate layer.

Optimizing perceptual loss means: make generated images close to real images in feature space, not in pixel space. This produces sharper, more realistic textures—because VGG features are sensitive to texture details.

Adversarial Loss

SRGAN’s second innovation is adversarial loss. The core idea of GAN is the adversarial game between Generator and Discriminator:

  • Generator G tries to generate realistic super-resolution images to fool the discriminator
  • Discriminator D tries to distinguish between real high-resolution images and fake images generated by the generator

The generator’s objective function includes two parts: perceptual loss and adversarial loss:

$$ \mathcal{L}G = \mathcal{L}{perceptual} + 10^{-3} \mathcal{L}{content} + \lambda \mathcal{L}{adv} $$

Where $\mathcal{L}_{adv}$ is adversarial loss:

$$ \mathcal{L}{adv} = \sum{n=1}^{N} -\log D(G(I_{LR}^{(n)})) $$

The generator wants $D(G(I_{LR}))$ as large as possible (discriminator thinks generated images are real), so maximizes $\log D(G(I_{LR}))$, i.e., minimizes $-\log D(G(I_{LR}))$.

The discriminator’s objective function is:

$$ \mathcal{L}D = -\sum{n=1}^{N} [\log D(I_{HR}^{(n)}) + \log (1 - D(G(I_{LR}^{(n)})))] $$

The discriminator wants $D(I_{HR})$ as large as possible (correctly identify real images), and $D(G(I_{LR}))$ as small as possible (correctly identify generated images).

Through this adversarial training, the generator learns to generate textures that better match the statistical distribution of real images—because if textures are too fake, the discriminator will easily identify them.

Generator Architecture

SRGAN’s generator uses a deep Residual Network (ResNet) structure with 16 residual blocks. Each residual block contains two $3 \times 3$ convolutional layers, with skip connections to mitigate gradient vanishing.

Upsampling uses sub-pixel convolution. Sub-pixel convolution was proposed by Shi et al. in 2016 in ESPCN (Efficient Sub-Pixel Convolutional Neural Network), and SRGAN adopted this technique.

The core idea of sub-pixel convolution: first use convolution to generate $r^2$ channel feature maps (where $r$ is the upscaling factor), then rearrange pixels through periodic shuffling, converting channel dimensions to spatial dimensions:

$$ I_{HR} = PS(W * I_{LR} + b) $$

Where $PS$ is the periodic shuffling operation. For example, if the upscaling factor is 2, first generate 4 channel feature maps, then rearrange them into a single-channel image with 2x width and height.

Sub-pixel convolution is more efficient than traditional bicubic interpolation—interpolation upsampling introduces massive redundant computation, while sub-pixel convolution directly learns upsampling filters in low-resolution space.

Figure 3 - Pixel Loss vs Perceptual Loss (why GAN generates sharper textures):

mermaid
flowchart TD
    PIXEL["Pixel Loss (MSE)<br/>Optimize pixel difference"] --> BLUR["Tend to generate<br/>blurry average results"]
    PIXEL --> REASON_P["Reason: Edge regions<br/>predict intermediate values<br/>minimize total error"]

    PERCEPTUAL["Perceptual Loss (VGG)<br/>Optimize feature space difference"] --> SHARP["Tend to generate<br/>sharp texture details"]
    PERCEPTUAL --> REASON_V["Reason: VGG features<br/>sensitive to texture<br/>penalize blurry results"]

    GAN["Adversarial Loss (GAN)<br/>Generator-Discriminator game"] --> REAL["Generate textures matching<br/>real image statistics"]
    GAN --> REASON_G["Reason: Discriminator<br/>can identify unnatural<br/>texture patterns"]

    classDef loss fill:#2196F3,color:#fff
    classDef result fill:#FF9800,color:#fff
    classDef reason fill:#9C27B0,color:#fff
    class PIXEL,PERCEPTUAL,GAN loss
    class BLUR,SHARP,REAL result
    class REASON_P,REASON_V,REASON_G reason

Tradeoff Between Quality and Metrics

An interesting phenomenon with SRGAN: generated images have significantly better visual quality (subjective perception) than SRCNN, but PSNR metrics might be slightly lower.

PSNR (Peak Signal-to-Noise Ratio) is calculated based on MSE, measuring pixel-level fidelity. MSE optimization tends to generate blurry average results, while GAN’s adversarial training introduces some “reasonable” textures—these textures might not be pixel-level exact matches, but look more realistic.

This reveals the core contradiction in image quality evaluation: pixel-level fidelity (PSNR) and perceptual quality don’t always align. SRGAN chose to optimize perceptual quality, sacrificing some pixel-level metrics.

But SRGAN is not the endpoint. Subsequent research (like ESRGAN, Real-ESRGAN) further optimized perceptual loss, adversarial loss, and network architectures, continuously improving quality.

From SRCNN to SRGAN: Evolution Logic

From SRCNN to SRGAN, deep learning super-resolution underwent two key evolutions:

  1. Network architecture evolution: from 3-layer shallow networks to deep ResNet, from fixed interpolation upsampling to efficient sub-pixel convolution upsampling
  2. Loss function evolution: from pixel-level MSE to feature space perceptual loss, from single loss function to adversarial training game mechanism

Both evolutions revolve around a core goal: make networks generate sharper, more realistic textures.

SRCNN proved the feasibility of deep learning in super-resolution, but was limited by network depth and loss function—generated images were still blurry. SRGAN broke through this limit through perceptual loss and adversarial loss, generating images with higher visual quality.

But SRGAN isn’t the endpoint either. Subsequent research (like ESRGAN, Real-ESRGAN) further optimized perceptual loss, adversarial loss, and network architectures, continuously improving quality.

Summary

The core idea of deep learning super-resolution is data-driven: learn mappings from paired low-resolution and high-resolution images, rather than relying on manually designed mathematical models.

SRCNN proved the feasibility of deep learning with a three-layer convolutional network, a foundational work in super-resolution. SRGAN elevated super-resolution from pixel-level fidelity to a new dimension of perceptual quality through perceptual loss and adversarial loss.

The key to understanding these methods: why is perceptual loss better than pixel loss? Why does adversarial training generate more realistic textures? The answer lies in—the human visual system is sensitive to texture and details, not just pixel values. Optimizing objective functions aligned with human perception generates visually better images.

References

  • Dong, C., Loy, C. C., He, K., & Tang, X. (2014). Learning a deep convolutional network for image super-resolution. In European conference on computer vision (pp. 184-199). Springer.
  • Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A. P., Bishop, R., … & Vedaldi, A. (2016). Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1874-1883).
  • Ledig, C., Theis, L., Huszár, F., Caballero, J., Cunningham, A., Acosta, A., … & Shi, W. (2017). Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4681-4690).