Embedded ANC: STM32 in Practice

Hardware Architecture

Implementing real-time ANC begins with selecting the right hardware platform. The controller must complete adaptive filtering within microseconds while managing multiple audio data streams.

ModuleFunctionTypical Choice
Main ControllerExecutes adaptive algorithmSTM32F4/F7, ESP32-S3, nRF5340
Reference MicCaptures ambient noiseKnowles SPH0645, Infineon IM69D130
Error MicCaptures residual noiseKnowles SPH0645, TDK ICS-43434
Audio DACOutputs anti-noise signalES9218, PCM5102
AmplifierDrives speakerClass-D

The reference microphone sits outside the earcup and captures external ambient noise as the algorithm’s reference input. The error microphone sits inside the earcup near the speaker, capturing residual noise for evaluating cancellation performance and driving adaptive updates. The audio DAC converts the digital anti-noise signal to analog, which is amplified by a Class-D amplifier to drive the speaker.

MEMS Microphone Interface

Modern ANC products widely use digital MEMS microphones with integrated ADCs and PDM (pulse density modulation) interfaces. PDM uses 1-bit oversampling modulation with a master clock of 1.024–3.072 MHz and dual-edge sampling achieving 64× or 128× oversampling.

STM32 connects directly to PDM microphones through the DFSDM (digital filter for sigma-delta modulator) peripheral. DFSDM has a built-in Sinc filter that converts the PDM stream to 16-bit or 24-bit PCM data, eliminating the need for external codecs.

DFSDM Configuration

The following code configures the STM32 DFSDM channel and filter to capture audio from a digital MEMS microphone:

c
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
void DFSDM_Init(void) {
    hdfsdm1_Channel0.Instance = DFSDM1_Channel0;
    hdfsdm1_Channel0.Init.OutputClock.Divider = 32;
    hdfsdm1_Channel0.Init.OutputClock.Selection = DFSDM_CHANNEL_OUTPUT_CLOCK_AUDIO;
    hdfsdm1_Channel0.Init.Input.Multiplexer = DFSDM_CHANNEL_EXTERNAL_INPUTS;
    hdfsdm1_Channel0.Init.Input.DataPacking = DFSDM_CHANNEL_STANDARD_MODE;
    hdfsdm1_Channel0.Init.Input.Pins = DFSDM_CHANNEL_SAME_CHANNEL_PINS;
    hdfsdm1_Channel0.Init.SerialInterface.Type = DFSDM_CHANNEL_SPI_RISING;
    hdfsdm1_Channel0.Init.SerialInterface.SpiClock = DFSDM_CHANNEL_SPI_CLOCK_INTERNAL;
    hdfsdm1_Channel0.Init.Awd.FilterOrder = DFSDM_CHANNEL_FASTSINC_ORDER;
    hdfsdm1_Channel0.Init.Awd.Oversampling = 64;
    HAL_DFSDM_ChannelInit(&hdfsdm1_Channel0);

    hdfsdm1_Filter0.Instance = DFSDM1_Filter0;
    hdfsdm1_Filter0.Init.RegularParam.Trigger = DFSDM_FILTER_SW_TRIGGER;
    hdfsdm1_Filter0.Init.RegularParam.FastMode = ENABLE;
    hdfsdm1_Filter0.Init.SincOrder = DFSDM_FILTER_SINC3_ORDER;
    hdfsdm1_Filter0.Init.Oversampling = 64;
    hdfsdm1_Filter0.Init.IntOversampling = 1;
    HAL_DFSDM_FilterInit(&hdfsdm1_Filter0);

    HAL_DFSDM_FilterConfigRegChannel(&hdfsdm1_Filter0,
                                     DFSDM_CHANNEL_0,
                                     DFSDM_CONTINUOUS_CONV_ON);
}

Key parameter descriptions:

  • OutputClock.Divider = 32: Divides the system clock to the PDM master clock, typically 2.4 MHz
  • SincOrder = DFSDM_FILTER_SINC3_ORDER: Third-order Sinc filter for better passband flatness
  • Oversampling = 64: Oversampling ratio, combined with the PDM clock to determine the final sample rate
mermaid
flowchart TD
    PDM["PDM Bitstream<br/>1-bit @ 2.048MHz"] --> SINC["Sinc³ Digital Filter<br/>Decimation"]
    SINC --> DEC["Decimation<br/>64x → 32kHz"]
    DEC --> PCM["16-bit PCM<br/>@ 16kHz"]

    classDef raw fill:#f44336,color:#fff
    classDef filter fill:#2196F3,color:#fff
    classDef out fill:#4CAF50,color:#fff
    class PDM raw
    class SINC,DEC filter
    class PCM out

The core processing chain inside DFSDM is shown above. The PDM bitstream passes through a Sinc³ filter for decimation and downsampling, finally producing 16-bit PCM data.

Sinc³ (third-order Sinc) is chosen over FastSinc for its balance between latency and stopband attenuation. The Sinc³ transfer function is H(z) = ((1 - z⁻ᴿ) / (1 - z⁻¹))³, providing steeper roll-off and better stopband attenuation (~ -60 dB/dec) at the same oversampling ratio — ideal for audio applications. FastSinc has lower computational cost and group delay, but its poor stopband attenuation typically limits it to watchdog or threshold detection; it is not suitable for ANC’s main audio path.

A dual-microphone configuration (reference + error) requires two DFSDM channels, each bound to a different microphone data line.

CMSIS-DSP FXLMS Implementation

CMSIS-DSP is ARM’s official digital signal processing library, with SIMD optimizations for Cortex-M cores. When implementing FXLMS on STM32, use arm_fir_f32 directly for filtering rather than writing manual loops.

arm_fir_f32 leverages the DSP extension instruction set of Cortex-M4F/M7 cores, performing filtering via SIMD (single instruction, multiple data). On FPU-equipped cores, a single VFMA instruction completes 2 floating-point multiply-accumulates in one cycle. Combined with loop unrolling and the dual-issue pipeline, actual throughput can reach 2–4 MACs per cycle. CMSIS-DSP is deeply optimized for this — handwritten assembly rarely outperforms it.

Data Structure Definitions

c
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
#include "arm_math.h"
#include "cmsis_os.h"

#define BLOCK_SIZE      64
#define FILTER_TAPS     64
#define NUM_TAPS_STAGE  32

typedef struct {
    arm_fir_instance_f32 fir_inst;
    float32_t state[FILTER_TAPS + BLOCK_SIZE - 1];
    float32_t coeffs[FILTER_TAPS];
} CMSIS_FIR_Filter;

typedef struct {
    CMSIS_FIR_Filter w_filter;
    CMSIS_FIR_Filter s_filter;
    float32_t mu;
    float32_t x_buf[BLOCK_SIZE];
    float32_t e_buf[BLOCK_SIZE];
} FXLMS_CMSIS;

w_filter is the adaptive controller; its coefficients are continuously updated at runtime. s_filter is the estimate of the secondary path S(z); its coefficients are determined through offline identification and do not change during operation.

Initialization and Processing

c
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
void fxlms_cmsis_init(FXLMS_CMSIS *fx, float32_t *s_hat, uint32_t s_len) {
    arm_fir_init_f32(&fx->w_filter.fir_inst, FILTER_TAPS,
                     fx->w_filter.coeffs, fx->w_filter.state, BLOCK_SIZE);
    memset(fx->w_filter.coeffs, 0, sizeof(fx->w_filter.coeffs));
    arm_fir_init_f32(&fx->s_filter.fir_inst, s_len,
                     s_hat, fx->s_filter.state, BLOCK_SIZE);
    fx->mu = 0.0005f;
}

void fxlms_cmsis_process(FXLMS_CMSIS *fx, float32_t *x_ref,
                         float32_t *e_mic, float32_t *output,
                         uint32_t block_size) {
    float32_t x_filtered[BLOCK_SIZE];
    float32_t y_output[BLOCK_SIZE];

    arm_fir_f32(&fx->w_filter.fir_inst, x_ref, y_output, block_size);
    arm_fir_f32(&fx->s_filter.fir_inst, x_ref, x_filtered, block_size);

    for (uint32_t n = 0; n < block_size; n++) {
        float32_t norm_factor = x_filtered[n] * x_filtered[n] + 1e-6f;
        float32_t step = fx->mu / norm_factor;
        for (uint32_t k = 0; k < FILTER_TAPS; k++) {
            if (n >= k) {
                fx->w_filter.coeffs[k] += step * e_mic[n] * x_filtered[n - k];
            }
        }
    }
    memcpy(output, y_output, block_size * sizeof(float32_t));
}

The weight update above uses the Normalized LMS (NLMS) strategy. The standard FXLMS weight recursion formula is:

1
w_k(n+1) = w_k(n) + μ · e(n) · x'(n-k)

where μ is a fixed step size. NLMS introduces a normalization factor, adjusting the step size to:

1
μ(n) = μ₀ / (||x'(n)||² + ε)

This automatically reduces the step size when the reference signal power is large (preventing divergence) and increases it when the signal power is small (accelerating convergence). The small constant ε (set to 1e-6 in the code) prevents division by zero.

Execution flow of fxlms_cmsis_process:

  1. Reference signal x_ref passes through W(z) to generate the anti-noise output y_output
  2. Reference signal passes through the S(z) estimate to produce the filtered reference x_filtered
  3. The filtered reference updates the W(z) coefficients with normalized step size

The weight update uses x_filtered rather than the raw x_ref — this is the core distinction between FXLMS and standard LMS, compensating for the phase shift introduced by the secondary path.

FreeRTOS Real-Time Task Framework

In production products, ANC processing runs as a real-time task triggered by a timer or DMA interrupt.

c
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
void ANC_Processing_Task(void *argument) {
    FXLMS_CMSIS fxlms;
    float32_t s_hat[NUM_TAPS_STAGE] = { /* offline identification results */ };
    fxlms_cmsis_init(&fxlms, s_hat, NUM_TAPS_STAGE);

    int16_t raw_ref[BLOCK_SIZE];
    int16_t raw_err[BLOCK_SIZE];
    float32_t f_ref[BLOCK_SIZE], f_err[BLOCK_SIZE];
    float32_t output[BLOCK_SIZE];
    int16_t dac_output[BLOCK_SIZE];

    for (;;) {
        ulTaskNotifyTake(pdTRUE, portMAX_DELAY);
        arm_q15_to_float(raw_ref, f_ref, BLOCK_SIZE);
        arm_q15_to_float(raw_err, f_err, BLOCK_SIZE);
        fxlms_cmsis_process(&fxlms, f_ref, f_err, output, BLOCK_SIZE);
        arm_float_to_q15(output, dac_output, BLOCK_SIZE);
        HAL_DAC_Start_DMA(&hdac, DAC_CHANNEL_1, (uint32_t*)dac_output,
                          BLOCK_SIZE, DAC_ALIGN_12B_L);
    }
}
mermaid
flowchart TD
    IRQ["DMA Half-Transfer Complete IRQ"] --> NOTIFY["ulTaskNotifyTake<br/>Wake ANC Task"]
    NOTIFY --> PROC["Process First Half<br/>64 samples"]
    PROC --> DAC["DMA Start DAC Output"]
    IRQ2["DMA Full-Transfer Complete IRQ"] --> NOTIFY2["Wake ANC Task"]
    NOTIFY2 --> PROC2["Process Second Half<br/>64 samples"]
    PROC2 --> DAC2["DMA Start DAC Output"]
    DAC2 --> IRQ

    classDef irq fill:#f44336,color:#fff
    classDef task fill:#2196F3,color:#fff
    classDef out fill:#4CAF50,color:#fff
    class IRQ,IRQ2 irq
    class NOTIFY,PROC,NOTIFY2,PROC2 task
    class DAC,DAC2 out

The DMA half-transfer and full-transfer complete interrupts fire alternately, driving the ANC task to process data blocks in a pipeline fashion.

Task flow:

  1. Wait for notification (ulTaskNotifyTake) — the DMA sends a notification after completing a data acquisition
  2. Type conversion — Q15 to float (arm_q15_to_float), CMSIS-DSP provides batch conversion functions
  3. Execute FXLMS — call fxlms_cmsis_process
  4. Convert result back to Q15 (arm_float_to_q15), start DAC DMA transfer

Real-Time Optimization

ANC is extremely latency-sensitive. The total delay from error microphone capture to speaker output must be kept within 0.5 ms; otherwise the phase shift grows large enough to cause algorithm failure or even positive feedback oscillation.

OptimizationDescriptionTypical Effect
Lower sample rate16 kHz instead of 48 kHz3× reduction in computation
Smaller block size16–32 samplesLatency reduced to 1–2 ms
Fewer filter taps32–64 tapsThousands of MACs saved per block
Fixed-point arithmeticQ15/Q31 formatEliminates floating-point conversion overhead
DMA double bufferingPing-pong bufferEliminates data transfer wait time
Core affinityBind ANC to dedicated coreAvoids task switching jitter

Latency Calculation

Understanding latency requires building intuition: at 16 kHz sample rate, one sample period = 1/16000 = 62.5 μs. Block size 16 → block latency = 16 × 62.5 μs = 1 ms. The total end-to-end latency from noise entering the microphone to anti-noise leaving the speaker consists of three components:

Latency ComponentTypical ValueDescription
Block latency1–4 msTime to fill a DMA buffer, depends on sample rate and block size
Compute latency0.1–0.5 msFXLMS filtering + weight update time
DAC latency0.1–0.3 msDAC conversion + amplifier response
Total latency1.2–4.8 msMust be <0.5 ms for effective cancellation → extreme optimization needed

The 0.5 ms target is an empirical threshold for active noise cancellation. Beyond this, the phase shift at frequencies above 1600 Hz exceeds 180°, and feedback transitions from cancellation to enhancement (positive feedback oscillation). Practical ANC systems therefore combine a 16 kHz sample rate + 16 block size + 32 filter taps + fixed-point arithmetic, compressing total latency to 0.3–0.4 ms.

Fixed-Point Optimization

Although Cortex-M4/M7 have hardware support for floating-point FXLMS, the fixed-point version can further reduce latency. CMSIS-DSP provides arm_fir_q15 and arm_fir_q31 function families. Converting coefficients and signals to Q15 format reduces each filter operation from multiple instructions to a single SIMD instruction.

Float vs Fixed-Point Tradeoffs: LMS weight updates involve numerous multiply-accumulate operations. In Q15 format, the product of two 16-bit fixed-point numbers is 31 bits (preserving the sign bit), requiring saturation handling before right-shifting back to the Q15 range — overflow would otherwise cause drastic coefficient jumps. Floating-point numbers (float32) are processed directly by the hardware FPU; IEEE 754’s exponent bits automatically handle the dynamic range, eliminating overflow concerns and making code more intuitive.

Dimensionfloat32Q15/Q31
Dynamic range±3.4×10³⁸±1 (Q15) / ±2³¹ (Q31)
Overflow riskNoneSaturation required
Single-cycle MAC1 (VFMA)2 (SMUAD)
Power consumptionHigherLower
Code readabilityHighLow

In summary, floating-point suits prototyping and M7-series parts with ample performance headroom; fixed-point is appropriate for power-constrained or cost-sensitive scenarios, though debugging effort increases significantly. Many products validate on a floating-point M7 prototype, then port to fixed-point M4 for cost optimization.

DMA Double Buffering

Double buffering (ping-pong mode) is a classic DMA technique. Two buffers alternate roles: while the DMA controller writes new data to buffer A, the CPU processes the already-ready data in buffer B; when DMA completes writing to A, it immediately switches to B, while the CPU starts processing A. This alternating mechanism eliminates data transfer wait time, allowing CPU and DMA to work fully in parallel.

At a 16 kHz sample rate with a block size of 64, DMA fills one buffer in approximately 4 ms. With single buffering, the CPU must wait for DMA to finish before starting processing, wasting half the available time. Double buffering nearly doubles effective computational throughput.

c
1
2
3
4
5
// DMA ping-pong — active_buf toggles between 0 and 1
HAL_DFSDM_FilterRegConvStart_DMA(&hdfsdm1_Filter0,
                                 ping_pong_buf[active_buf],
                                 BLOCK_SIZE);
// CPU simultaneously processes ping_pong_buf[1 - active_buf]

The animation below shows the ping-pong alternation — DMA write and CPU processing in parallel, the two buffers constantly swapping roles:

Buffer A Buffer B
DMA 写入 CPU 处理
CPU 处理 DMA 写入
Ping-Pong Buffer · DMA 写入 ↔ CPU 读取

Heterogeneous Architecture (ESP32 + STM32)

In complex ANC systems, a single MCU struggles to simultaneously meet the demands of real-time audio processing and non-real-time system management. A common approach is a heterogeneous architecture: ESP32 handles wireless connectivity, user interaction, and system management, while STM32 focuses on real-time audio processing.

The two chips communicate over UART, SPI, or I2C. The data passed includes audio streams and control commands.

Custom Communication Frame Format

c
1
2
3
4
5
6
7
typedef struct {
    uint8_t  header[2];     // 0xAA 0x55
    uint8_t  type;          // 0x01=audio data, 0x02=control command
    uint16_t seq;
    uint16_t len;
    int16_t  samples[];
} __attribute__((packed)) AudioFrame;

The header field provides frame synchronization — the receiver starts parsing upon detecting 0xAA 0x55. type distinguishes between audio sample streams and parameter configuration commands. seq enables packet loss detection and reordering.

The ESP32 handles the Wi-Fi/Bluetooth protocol stack, user interface, and parameter updates. When a user adjusts the ANC mode through a mobile app, the ESP32 packages the new filter coefficients or step-size parameters into a control frame and sends it to the STM32. The STM32 hot-updates the algorithm parameters on receipt without pausing audio processing.

Mermaid: STM32 ANC Data Flow

mermaid
flowchart TD
    subgraph Input
        REF["Reference Mic<br/>PDM"]
        ERR["Error Mic<br/>PDM"]
    end
    subgraph STM32
        DFSDM["DFSDM<br/>PDM→PCM"]
        DMA["DMA<br/>Ping-Pong"]
        FXLMS["FXLMS<br/>CMSIS-DSP"]
        DAC["DAC<br/>PCM→Analog"]
        TASK["FreeRTOS<br/>ANC Task"]
    end
    subgraph Output
        SPK["Speaker<br/>Anti-noise"]
    end
    subgraph ESP
        WIFI["Wi-Fi/BT<br/>Parameter Push"]
    end

    REF --> DFSDM
    ERR --> DFSDM
    DFSDM --> DMA
    DMA --> TASK
    TASK --> FXLMS
    FXLMS --> DAC
    DAC --> SPK
    WIFI -->|UART/SPI| TASK