Embedded ANC: STM32 in Practice
Hardware Architecture
Implementing real-time ANC begins with selecting the right hardware platform. The controller must complete adaptive filtering within microseconds while managing multiple audio data streams.
| Module | Function | Typical Choice |
|---|---|---|
| Main Controller | Executes adaptive algorithm | STM32F4/F7, ESP32-S3, nRF5340 |
| Reference Mic | Captures ambient noise | Knowles SPH0645, Infineon IM69D130 |
| Error Mic | Captures residual noise | Knowles SPH0645, TDK ICS-43434 |
| Audio DAC | Outputs anti-noise signal | ES9218, PCM5102 |
| Amplifier | Drives speaker | Class-D |
The reference microphone sits outside the earcup and captures external ambient noise as the algorithm’s reference input. The error microphone sits inside the earcup near the speaker, capturing residual noise for evaluating cancellation performance and driving adaptive updates. The audio DAC converts the digital anti-noise signal to analog, which is amplified by a Class-D amplifier to drive the speaker.
MEMS Microphone Interface
Modern ANC products widely use digital MEMS microphones with integrated ADCs and PDM (pulse density modulation) interfaces. PDM uses 1-bit oversampling modulation with a master clock of 1.024–3.072 MHz and dual-edge sampling achieving 64× or 128× oversampling.
STM32 connects directly to PDM microphones through the DFSDM (digital filter for sigma-delta modulator) peripheral. DFSDM has a built-in Sinc filter that converts the PDM stream to 16-bit or 24-bit PCM data, eliminating the need for external codecs.
DFSDM Configuration
The following code configures the STM32 DFSDM channel and filter to capture audio from a digital MEMS microphone:
| |
Key parameter descriptions:
OutputClock.Divider = 32: Divides the system clock to the PDM master clock, typically 2.4 MHzSincOrder = DFSDM_FILTER_SINC3_ORDER: Third-order Sinc filter for better passband flatnessOversampling = 64: Oversampling ratio, combined with the PDM clock to determine the final sample rate
flowchart TD
PDM["PDM Bitstream<br/>1-bit @ 2.048MHz"] --> SINC["Sinc³ Digital Filter<br/>Decimation"]
SINC --> DEC["Decimation<br/>64x → 32kHz"]
DEC --> PCM["16-bit PCM<br/>@ 16kHz"]
classDef raw fill:#f44336,color:#fff
classDef filter fill:#2196F3,color:#fff
classDef out fill:#4CAF50,color:#fff
class PDM raw
class SINC,DEC filter
class PCM outThe core processing chain inside DFSDM is shown above. The PDM bitstream passes through a Sinc³ filter for decimation and downsampling, finally producing 16-bit PCM data.
Sinc³ (third-order Sinc) is chosen over FastSinc for its balance between latency and stopband attenuation. The Sinc³ transfer function is H(z) = ((1 - z⁻ᴿ) / (1 - z⁻¹))³, providing steeper roll-off and better stopband attenuation (~ -60 dB/dec) at the same oversampling ratio — ideal for audio applications. FastSinc has lower computational cost and group delay, but its poor stopband attenuation typically limits it to watchdog or threshold detection; it is not suitable for ANC’s main audio path.
A dual-microphone configuration (reference + error) requires two DFSDM channels, each bound to a different microphone data line.
CMSIS-DSP FXLMS Implementation
CMSIS-DSP is ARM’s official digital signal processing library, with SIMD optimizations for Cortex-M cores. When implementing FXLMS on STM32, use arm_fir_f32 directly for filtering rather than writing manual loops.
arm_fir_f32 leverages the DSP extension instruction set of Cortex-M4F/M7 cores, performing filtering via SIMD (single instruction, multiple data). On FPU-equipped cores, a single VFMA instruction completes 2 floating-point multiply-accumulates in one cycle. Combined with loop unrolling and the dual-issue pipeline, actual throughput can reach 2–4 MACs per cycle. CMSIS-DSP is deeply optimized for this — handwritten assembly rarely outperforms it.
Data Structure Definitions
| |
w_filter is the adaptive controller; its coefficients are continuously updated at runtime. s_filter is the estimate of the secondary path S(z); its coefficients are determined through offline identification and do not change during operation.
Initialization and Processing
| |
The weight update above uses the Normalized LMS (NLMS) strategy. The standard FXLMS weight recursion formula is:
| |
where μ is a fixed step size. NLMS introduces a normalization factor, adjusting the step size to:
| |
This automatically reduces the step size when the reference signal power is large (preventing divergence) and increases it when the signal power is small (accelerating convergence). The small constant ε (set to 1e-6 in the code) prevents division by zero.
Execution flow of fxlms_cmsis_process:
- Reference signal
x_refpasses through W(z) to generate the anti-noise outputy_output - Reference signal passes through the S(z) estimate to produce the filtered reference
x_filtered - The filtered reference updates the W(z) coefficients with normalized step size
The weight update uses x_filtered rather than the raw x_ref — this is the core distinction between FXLMS and standard LMS, compensating for the phase shift introduced by the secondary path.
FreeRTOS Real-Time Task Framework
In production products, ANC processing runs as a real-time task triggered by a timer or DMA interrupt.
| |
flowchart TD
IRQ["DMA Half-Transfer Complete IRQ"] --> NOTIFY["ulTaskNotifyTake<br/>Wake ANC Task"]
NOTIFY --> PROC["Process First Half<br/>64 samples"]
PROC --> DAC["DMA Start DAC Output"]
IRQ2["DMA Full-Transfer Complete IRQ"] --> NOTIFY2["Wake ANC Task"]
NOTIFY2 --> PROC2["Process Second Half<br/>64 samples"]
PROC2 --> DAC2["DMA Start DAC Output"]
DAC2 --> IRQ
classDef irq fill:#f44336,color:#fff
classDef task fill:#2196F3,color:#fff
classDef out fill:#4CAF50,color:#fff
class IRQ,IRQ2 irq
class NOTIFY,PROC,NOTIFY2,PROC2 task
class DAC,DAC2 outThe DMA half-transfer and full-transfer complete interrupts fire alternately, driving the ANC task to process data blocks in a pipeline fashion.
Task flow:
- Wait for notification (
ulTaskNotifyTake) — the DMA sends a notification after completing a data acquisition - Type conversion — Q15 to float (
arm_q15_to_float), CMSIS-DSP provides batch conversion functions - Execute FXLMS — call
fxlms_cmsis_process - Convert result back to Q15 (
arm_float_to_q15), start DAC DMA transfer
Real-Time Optimization
ANC is extremely latency-sensitive. The total delay from error microphone capture to speaker output must be kept within 0.5 ms; otherwise the phase shift grows large enough to cause algorithm failure or even positive feedback oscillation.
| Optimization | Description | Typical Effect |
|---|---|---|
| Lower sample rate | 16 kHz instead of 48 kHz | 3× reduction in computation |
| Smaller block size | 16–32 samples | Latency reduced to 1–2 ms |
| Fewer filter taps | 32–64 taps | Thousands of MACs saved per block |
| Fixed-point arithmetic | Q15/Q31 format | Eliminates floating-point conversion overhead |
| DMA double buffering | Ping-pong buffer | Eliminates data transfer wait time |
| Core affinity | Bind ANC to dedicated core | Avoids task switching jitter |
Latency Calculation
Understanding latency requires building intuition: at 16 kHz sample rate, one sample period = 1/16000 = 62.5 μs. Block size 16 → block latency = 16 × 62.5 μs = 1 ms. The total end-to-end latency from noise entering the microphone to anti-noise leaving the speaker consists of three components:
| Latency Component | Typical Value | Description |
|---|---|---|
| Block latency | 1–4 ms | Time to fill a DMA buffer, depends on sample rate and block size |
| Compute latency | 0.1–0.5 ms | FXLMS filtering + weight update time |
| DAC latency | 0.1–0.3 ms | DAC conversion + amplifier response |
| Total latency | 1.2–4.8 ms | Must be <0.5 ms for effective cancellation → extreme optimization needed |
The 0.5 ms target is an empirical threshold for active noise cancellation. Beyond this, the phase shift at frequencies above 1600 Hz exceeds 180°, and feedback transitions from cancellation to enhancement (positive feedback oscillation). Practical ANC systems therefore combine a 16 kHz sample rate + 16 block size + 32 filter taps + fixed-point arithmetic, compressing total latency to 0.3–0.4 ms.
Fixed-Point Optimization
Although Cortex-M4/M7 have hardware support for floating-point FXLMS, the fixed-point version can further reduce latency. CMSIS-DSP provides arm_fir_q15 and arm_fir_q31 function families. Converting coefficients and signals to Q15 format reduces each filter operation from multiple instructions to a single SIMD instruction.
Float vs Fixed-Point Tradeoffs: LMS weight updates involve numerous multiply-accumulate operations. In Q15 format, the product of two 16-bit fixed-point numbers is 31 bits (preserving the sign bit), requiring saturation handling before right-shifting back to the Q15 range — overflow would otherwise cause drastic coefficient jumps. Floating-point numbers (float32) are processed directly by the hardware FPU; IEEE 754’s exponent bits automatically handle the dynamic range, eliminating overflow concerns and making code more intuitive.
| Dimension | float32 | Q15/Q31 |
|---|---|---|
| Dynamic range | ±3.4×10³⁸ | ±1 (Q15) / ±2³¹ (Q31) |
| Overflow risk | None | Saturation required |
| Single-cycle MAC | 1 (VFMA) | 2 (SMUAD) |
| Power consumption | Higher | Lower |
| Code readability | High | Low |
In summary, floating-point suits prototyping and M7-series parts with ample performance headroom; fixed-point is appropriate for power-constrained or cost-sensitive scenarios, though debugging effort increases significantly. Many products validate on a floating-point M7 prototype, then port to fixed-point M4 for cost optimization.
DMA Double Buffering
Double buffering (ping-pong mode) is a classic DMA technique. Two buffers alternate roles: while the DMA controller writes new data to buffer A, the CPU processes the already-ready data in buffer B; when DMA completes writing to A, it immediately switches to B, while the CPU starts processing A. This alternating mechanism eliminates data transfer wait time, allowing CPU and DMA to work fully in parallel.
At a 16 kHz sample rate with a block size of 64, DMA fills one buffer in approximately 4 ms. With single buffering, the CPU must wait for DMA to finish before starting processing, wasting half the available time. Double buffering nearly doubles effective computational throughput.
| |
The animation below shows the ping-pong alternation — DMA write and CPU processing in parallel, the two buffers constantly swapping roles:
Heterogeneous Architecture (ESP32 + STM32)
In complex ANC systems, a single MCU struggles to simultaneously meet the demands of real-time audio processing and non-real-time system management. A common approach is a heterogeneous architecture: ESP32 handles wireless connectivity, user interaction, and system management, while STM32 focuses on real-time audio processing.
The two chips communicate over UART, SPI, or I2C. The data passed includes audio streams and control commands.
Custom Communication Frame Format
| |
The header field provides frame synchronization — the receiver starts parsing upon detecting 0xAA 0x55. type distinguishes between audio sample streams and parameter configuration commands. seq enables packet loss detection and reordering.
The ESP32 handles the Wi-Fi/Bluetooth protocol stack, user interface, and parameter updates. When a user adjusts the ANC mode through a mobile app, the ESP32 packages the new filter coefficients or step-size parameters into a control frame and sends it to the STM32. The STM32 hot-updates the algorithm parameters on receipt without pausing audio processing.
Mermaid: STM32 ANC Data Flow
flowchart TD
subgraph Input
REF["Reference Mic<br/>PDM"]
ERR["Error Mic<br/>PDM"]
end
subgraph STM32
DFSDM["DFSDM<br/>PDM→PCM"]
DMA["DMA<br/>Ping-Pong"]
FXLMS["FXLMS<br/>CMSIS-DSP"]
DAC["DAC<br/>PCM→Analog"]
TASK["FreeRTOS<br/>ANC Task"]
end
subgraph Output
SPK["Speaker<br/>Anti-noise"]
end
subgraph ESP
WIFI["Wi-Fi/BT<br/>Parameter Push"]
end
REF --> DFSDM
ERR --> DFSDM
DFSDM --> DMA
DMA --> TASK
TASK --> FXLMS
FXLMS --> DAC
DAC --> SPK
WIFI -->|UART/SPI| TASK