Adaptive Filtering: From LMS to NLMS
The LMS Algorithm
The core problem in adaptive filtering is: given a reference signal x(n) and a desired signal d(n), find a filter coefficient vector w such that the output y(n) approximates d(n).
The Least Mean Square (LMS) algorithm is the most classic solution to this problem, proposed by Widrow and Hoff in 1960. The core idea is to take one step down the steepest gradient of the error surface at each iteration — stochastic gradient descent.
Mathematical Model
For a filter of length N:
- Weight vector:
w(n) = [w_0(n), w_1(n), ..., w_{N-1}(n)]^T - Reference signal vector:
x(n) = [x(n), x(n-1), ..., x(n-N+1)]^T
The filter output is the inner product of the two:
$$ y(n) = w^T(n) x(n) = \sum_{k=0}^{N-1} w_k(n) x(n-k) $$
The error signal:
$$ e(n) = d(n) - y(n) $$
The cost function is the mean square error:
$$ J(n) = E[e^2(n)] $$
Taking the gradient with respect to w yields the weight update rule:
$$ \nabla J(n) = -2 e(n) x(n) $$
$$ w(n+1) = w(n) - \mu \nabla J(n) = w(n) + \mu e(n) x(n) $$
where \mu is the step size that controls convergence speed.
Adaptive Filter Structure
The diagram below shows the closed-loop workflow of an adaptive filter:
flowchart TD
REF["Reference Signal x(n)"] --> W["Adaptive Filter W(z)<br/>Weights w(n)"]
W -->|"Output y(n)"| SUM(("⊕"))
DES["Desired Signal d(n)<br/>(Error Microphone)"] --> SUM
SUM -->|"Error e(n) = d(n) - y(n)"| UPDATE["Weight Update<br/>w(n+1) = w(n) + μ·e(n)·x(n)"]
REF --> UPDATE
UPDATE --> W
classDef signal fill:#2196F3,color:#fff
classDef proc fill:#9C27B0,color:#fff
classDef err fill:#f44336,color:#fff
class REF,DES signal
class W,UPDATE,SUM procThe reference signal x(n) passes through the adaptive filter to produce output y(n), which is compared with the desired signal d(n) to compute the error e(n). The error is then fed back to guide weight updates. This closed loop runs continuously—the error decreases over time until the filter converges to the optimal solution.
Formula Breakdown
The LMS weight update rule looks like a single line, but each term has a clear physical meaning:
$$ w(n+1) = w(n) + \mu \cdot e(n) \cdot x(n) $$
| Symbol | Meaning | Intuition |
|---|---|---|
w(n) | Filter coefficients (N taps) at time n | The filter’s “memory”—how it combines past input samples |
μ | Step size | How large a step to take each iteration—too large → overshoot and diverge; too small → painfully slow convergence |
e(n) | Error signal d(n) - y(n) | A gauge telling the algorithm “how wrong it is”—larger errors drive bigger corrections |
x(n) | Reference signal vector | Tells the algorithm “which direction to adjust”—weights are corrected along x |
An Intuitive Analogy: Blindfolded Hill Descent
Imagine you’re blindfolded on a hillside, trying to reach the lowest point in the valley:
e(n)is your perception of the slope—steeper slope means larger error, farther from the bottomμis your stride length—too large and you might leap past the valley to the opposite slope (divergence); too small and you’ll take forever (slow convergence)x(n)is the terrain beneath your feet—telling you which direction to step- Each update: feel the slope → take a step in the right direction → feel again → step again
Numerical Example: With μ = 0.001, e = 0.5, x = 0.8, the update amount = 0.001 × 0.5 × 0.8 = 0.0004. If e increases to 5.0, the update = 0.001 × 5.0 × 0.8 = 0.004—larger errors drive larger corrections.
Why Is It Called “Stochastic” Gradient Descent?
Strictly speaking, the LMS cost function is the mean square error J(n) = E[e²(n)], and the true gradient is ∇J = -2 · E[e(n) · x(n)]. But LMS skips the expectation—it directly uses the instantaneous value -2 · e(n) · x(n) as a gradient estimate.
This means using one sample’s error in place of the expectation over all samples—hence “stochastic.” While a single estimate is noisy, averaging over many iterations steers the weights in the same direction as true gradient descent. This “instantaneous-instead-of-expected” tradeoff is what makes LMS computationally cheap yet effective.
Convergence Condition
The necessary and sufficient condition for LMS convergence is:
$$ 0 < \mu < \frac{2}{N \cdot P_x} $$
where P_x = E[x^2(n)] is the input signal power. Larger step sizes converge faster, but exceeding the upper bound causes divergence.
Convergence Condition Breakdown
The inequality 0 < μ < 2 / (N · P_x) determines whether LMS works correctly:
N: Filter length. More taps means each tap shares less “energy,” lowering the μ upper bound. A 64-tap filter has half the μ ceiling of a 32-tap filter.P_x: Input signal power. Stronger signals require smaller μ—otherwise gradient noise spirals out of control.- Intuition: μ too large → weights overshoot the optimum → oscillation or divergence; μ too small → glacial convergence.
Numerical Example: With N = 64, P_x = 1.0 (unit power signal):
| |
In practice, take 1/10 to 1/2 of the theoretical upper bound—roughly 0.003 to 0.015—for a safe margin.
Convergence Process
LMS convergence follows a “coarse-to-fine” pattern:
- Initial phase: Weights start at zero (or random values), error is large, and each update step is aggressive
- Learning phase: Weights approach the optimal values, error drops rapidly
- Steady-state phase: Weights jitter slightly around the optimum (“misadjustment noise”), error no longer decreases noticeably
The diagram below illustrates this iterative loop:
flowchart TD
A["Initialize<br/>w(0) = 0"] --> B["Process Sample<br/>y = w^T · x"]
B --> C["Compute Error<br/>e = d - y"]
C --> D["Update Weights<br/>w += μ·e·x"]
D --> E{"Converged?"}
E -->|"No<br/>Error still large"| B
E -->|"Yes<br/>Error small enough"| F["Steady State<br/>Fine-tuning"]
classDef init fill:#FF9800,color:#fff
classDef loop fill:#2196F3,color:#fff
classDef done fill:#4CAF50,color:#fff
class A init
class B,C,D loop
class F doneIf you plot the error over time, you’ll see a curve that drops steeply at first, then gradually flattens out—like a playground slide, steep at the top and gentle at the bottom.
The animation below draws this “slide” curve, clearly showing the transition through three phases:
C Implementation
A circular buffer manages the reference signal delay line, avoiding data movement at each iteration:
| |
lms_process first writes the reference sample into the circular buffer, computes the filter output, then updates each weight with w += mu * e * x. The ptr cycles forward, effectively creating a sliding window over the input.
The NLMS Algorithm
LMS has a practical problem: the step size \mu is highly sensitive to input signal amplitude. Large signals amplify gradient noise and can cause divergence; small signals make convergence painfully slow. Every new deployment scenario requires retuning \mu.
Normalized LMS (NLMS) solves this — it divides the step size by the instantaneous power of the input signal, making the update self-tuning.
Weight Update
The NLMS weight update formula:
$$ w(n+1) = w(n) + \frac{\mu}{x^T(n) x(n) + \varepsilon} e(n) x(n) $$
The denominator x^T(n) x(n) is the energy of the reference signal vector (sum of squared samples), and \varepsilon is a small positive constant preventing division by zero.
NLMS Formula Breakdown
The NLMS update can be understood as LMS with an “adaptive step size” mechanism:
$$ w(n+1) = w(n) + \frac{\mu}{x^T(n) x(n) + \varepsilon} \cdot e(n) \cdot x(n) $$
| Component | Meaning |
|---|---|
Denominator x^T(n) x(n) | Total energy of the current input signal (sum of squared samples). Strong signal → larger value → step shrinks; weak signal → smaller value → step grows |
ε (epsilon) | Protection against division by zero. Input may go silent (e.g., pause in audio), default 1e-6 |
μ / (x^T x + ε) | Effective step size, self-tuning per signal energy |
Numerical Example: With μ = 0.5, input x = [0.1, -0.05, 0.2, ...], energy x^T x = 0.1² + (-0.05)² + 0.2² + ... = 0.0525, then effective step = 0.5 / (0.0525 + 1e-6) ≈ 9.52. If the signal strengthens to x = [0.5, -0.3, 0.8, ...], energy ≈ 0.98, effective step ≈ 0.5 / 0.98 ≈ 0.51—strong signal automatically reduces the step size.
Comparing LMS and NLMS updates:
| Algorithm | Update |
|---|---|
| LMS | \mu \cdot e \cdot x |
| NLMS | \frac{\mu}{x^T x} \cdot e \cdot x |
NLMS effectively adjusts the equivalent step size at each iteration based on signal energy. Strong signals automatically reduce the step size; weak signals amplify it, keeping gradient noise stable.
Step Size Range
NLMS step size \mu guarantees convergence in the range (0, 2). Compared to LMS which requires tedious tuning per signal power, NLMS is far more practical in real systems.
LMS vs NLMS: Which One to Use?
| Scenario | Recommended | Reason |
|---|---|---|
| Stable signal amplitude (e.g., sine wave tracking, power line noise cancellation) | LMS | Lower computational cost—no division needed |
| Frequently changing amplitude (e.g., speech, audio ANC) | NLMS | Self-tuning step size, no need for retuning |
| Low-power embedded platform (no FPU) | LMS or fixed-point NLMS | LMS avoids division and energy accumulation; NLMS fixed-point can precompute energy |
| Quick prototyping, robustness preferred | NLMS | μ = 0.5 usually works out of the box |
Rule of thumb: Try NLMS first. Only switch to LMS if you need to save the last few MIPS. NLMS has fewer tuning knobs and better robustness—most scenarios work on the first try.
Fixed-Point NLMS Implementation
When the embedded system lacks an FPU, implement NLMS with Q31 fixed-point arithmetic. Weights and signals are int32_t values, and the q31_mul macro handles saturated shifts:
| |
The fixed-point version uses q31_mul for energy accumulation to avoid 32-bit overflow. MU_Q31 and EPS_Q31 are in Q31 format, equivalent to approximately 0.02 and 0.00001 in floating point.