Embedded ANC: ESP32 Practice

The ESP32-S3’s dual-core Xtensa LX7 processor with vector instruction set extensions is well-suited for the real-time DSP workloads required by embedded ANC. Combined with the ESP-DSP library, efficient adaptive filtering becomes practical on this platform.

I2S Microphone Capture

ANC requires at least two synchronous input channels: a reference microphone (capturing ambient noise) and an error microphone (capturing residual error). The ESP32-S3 I2S peripheral supports simultaneous multi-channel ADC data reception. Configuring it for 16-bit, 16 kHz sampling suffices for consumer-grade ANC.

c
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#include "driver/i2s.h"

#define I2S_SAMPLE_RATE     16000
#define I2S_BUFFER_SIZE     1024

void i2s_mic_init(void) {
    i2s_config_t i2s_config = {
        .mode = I2S_MODE_MASTER | I2S_MODE_RX,
        .sample_rate = I2S_SAMPLE_RATE,
        .bits_per_sample = I2S_BITS_PER_SAMPLE_16BIT,
        .channel_format = I2S_CHANNEL_FMT_ONLY_LEFT,
        .communication_format = I2S_COMM_FORMAT_STAND_I2S,
        .intr_alloc_flags = ESP_INTR_FLAG_LEVEL1,
        .dma_buf_count = 4,
        .dma_buf_len = I2S_BUFFER_SIZE,
        .use_apll = true,
        .tx_desc_auto_clear = false,
        .fixed_mclk = 0
    };

    i2s_pin_config_t pin_config = {
        .bck_io_num = GPIO_NUM_4,
        .ws_io_num = GPIO_NUM_5,
        .data_out_num = I2S_PIN_NO_CHANGE,
        .data_in_num = GPIO_NUM_6
    };

    i2s_driver_install(I2S_NUM_0, &i2s_config, 0, NULL);
    i2s_set_pin(I2S_NUM_0, &pin_config);
}

void read_mic_samples(int16_t *buffer, size_t len) {
    size_t bytes_read;
    i2s_read(I2S_NUM_0, buffer, len * sizeof(int16_t), &bytes_read, portMAX_DELAY);
}

Setting use_apll = true enables the Audio PLL (Phase-Locked Loop) for a more precise audio clock. The APLL is generated by a dedicated PLL circuit with significantly less jitter than the default I2S clock—clock jitter translates directly into sampling timing errors, which in a phase-sensitive system like ANC severely degrades noise cancellation depth. The dma_buf_count = 4 and dma_buf_len = 1024 settings mean: 4 rotating buffers, each holding 1024 samples, for a total buffer of 4096 samples @ 16 kHz ≈ 256 ms. A larger buffer tolerates occasional CPU scheduling delays but increases I2S capture latency—a trade-off between real-time responsiveness and robustness.

The diagram below shows the complete ANC data flow from microphones through DMA into the NLMS processor, and finally to the speaker output:

mermaid
flowchart TD
    MIC1["Reference Mic<br/>I2S CH0"] --> DMA["DMA Double Buffer<br/>128 samples/frame"]
    MIC2["Error Mic<br/>I2S CH1"] --> DMA
    DMA --> DSP["NLMS Processing<br/>Normalized Step"]
    DSP --> DAC["I2S Output<br/>16bit PCM"]
    DAC --> SPK["Speaker"]

    classDef input fill:#2196F3,color:#fff
    classDef proc fill:#9C27B0,color:#fff
    classDef output fill:#4CAF50,color:#fff
    class MIC1,MIC2,DMA input
    class DSP proc
    class DAC,SPK output

Adaptive Filter Core

The NLMS filter is the heart of the entire ANC system. On the ESP32-S3, a filter order of 64 covers the dominant low-frequency noise below 1 kHz. The circular buffer avoids shifting all data on every sample, reducing the per-sample complexity from O(N) to O(1).

c
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#include "esp_dsp.h"
#include "driver/i2s.h"
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"

#define SAMPLE_RATE     16000
#define FRAME_SIZE      128
#define FILTER_ORDER    64

typedef struct {
    float w[FILTER_ORDER];          // filter coefficients
    float x_ring[FILTER_ORDER];     // reference signal ring buffer
    float s_hat[32];                // secondary path estimate
    float s_ring[32];               // filtered signal ring buffer
    int w_ptr;                      // ring buffer write pointer
    int s_ptr;
    float mu;                       // adaptation step size
} ESP32_ANC;

static ESP32_ANC g_anc;

void esp_anc_init(void) {
    memset(&g_anc, 0, sizeof(g_anc));
    g_anc.mu = 0.0005f;
    dsps_fft2r_init_fc32(NULL, CONFIG_DSP_MAX_FFT_SIZE);
}

The secondary path s_hat has 32 taps. It is typically obtained through offline identification—playing white noise before deployment and recording it with the error microphone, then estimating the transfer function from speaker to error mic via LMS.

NLMS Frame Processing

The real-time processing flow for each audio frame (128 samples):

c
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
static inline float fast_convolve(const float *w, const float *x, int ptr, int len) {
    float output = 0.0f;
    for (int i = 0; i < len; i++) {
        int idx = (ptr + len - i) % len;
        output += w[i] * x[idx];
    }
    return output;
}

void esp_anc_process_frame(float *ref, float *err, float *out, int len) {
    for (int n = 0; n < len; n++) {
        g_anc.x_ring[g_anc.w_ptr] = ref[n];
        g_anc.s_ring[g_anc.s_ptr] = ref[n];

        float y = fast_convolve(g_anc.w, g_anc.x_ring, g_anc.w_ptr, FILTER_ORDER);
        float x_filt = fast_convolve(g_anc.s_hat, g_anc.s_ring, g_anc.s_ptr, 32);

        float power = 0.0f;
        for (int i = 0; i < FILTER_ORDER; i++)
            power += g_anc.x_ring[i] * g_anc.x_ring[i];
        float norm_step = g_anc.mu / (power + 1e-6f);

        for (int i = 0; i < FILTER_ORDER; i++) {
            int idx = (g_anc.w_ptr + FILTER_ORDER - i) % FILTER_ORDER;
            g_anc.w[i] += norm_step * err[n] * g_anc.x_ring[idx];
        }

        out[n] = y;
        g_anc.w_ptr = (g_anc.w_ptr + 1) % FILTER_ORDER;
        g_anc.s_ptr = (g_anc.s_ptr + 1) % 32;
    }
}

The ring buffer convolution access pattern is most intuitive as an animation — note how the read pointer wraps around in reverse:

环形缓冲区 · 卷积
5 阶 FIR · 卷积计算
w = [0.40, 0.30, 0.20, 0.07, 0.03]

fast_convolve computes convolution using ring buffer indexing, avoiding data reordering on each sample. The regularization term 1e-6f in the NLMS coefficient update prevents division by zero when the reference signal power is too low. The step size mu = 0.0005 balances convergence speed against steady-state misadjustment—fast enough for steady noises like fans and engines without excessive fluctuation after convergence.

Dual-Core Task Assignment

The ANC loop is pinned to CPU1 (PRO_CPU), while CPU0 handles Wi-Fi/Bluetooth stacks and other system tasks. With a 16 kHz sample rate and a frame size of 128 samples, each frame has a processing budget of approximately 8 ms.

The diagram below illustrates the specific division of labor between the two cores—CPU0 handles system management while CPU1 is dedicated to the real-time ANC pipeline:

mermaid
flowchart TD
    subgraph CPU0["CPU0 - System Management"]
        A1["WiFi/BT Stack"]
        A2["Config Management"]
        A3["OTA Updates"]
    end
    subgraph CPU1["CPU1 - Real-time ANC"]
        B1["I2S Read<br/>Dual-channel Audio"]
        B2["NLMS/FXLMS<br/>Filter Computation"]
        B3["I2S Write<br/>Anti-noise Output"]
        B1 --> B2 --> B3 --> B1
    end

    classDef sys fill:#2196F3,color:#fff
    classDef anc fill:#f44336,color:#fff
    class A1,A2,A3 sys
    class B1,B2,B3 anc
c
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
static void anc_task(void *arg) {
    int16_t *raw_buffer = heap_caps_malloc(FRAME_SIZE * 2 * sizeof(int16_t),
                                           MALLOC_CAP_INTERNAL);
    float *f_ref = heap_caps_malloc(FRAME_SIZE * sizeof(float), MALLOC_CAP_INTERNAL);
    float *f_err = heap_caps_malloc(FRAME_SIZE * sizeof(float), MALLOC_CAP_INTERNAL);
    float *f_out = heap_caps_malloc(FRAME_SIZE * sizeof(float), MALLOC_CAP_INTERNAL);
    int16_t *dac_buffer = heap_caps_malloc(FRAME_SIZE * sizeof(int16_t), MALLOC_CAP_INTERNAL);

    size_t bytes_read;
    for (;;) {
        i2s_read(I2S_NUM_0, raw_buffer, FRAME_SIZE * 2 * sizeof(int16_t),
                 &bytes_read, portMAX_DELAY);

        for (int i = 0; i < FRAME_SIZE; i++) {
            f_ref[i] = raw_buffer[i * 2] / 32768.0f;
            f_err[i] = raw_buffer[i * 2 + 1] / 32768.0f;
        }

        esp_anc_process_frame(f_ref, f_err, f_out, FRAME_SIZE);

        for (int i = 0; i < FRAME_SIZE; i++) {
            float clamped = fmaxf(-1.0f, fminf(1.0f, f_out[i]));
            dac_buffer[i] = (int16_t)(clamped * 32767.0f);
        }

        i2s_write(I2S_NUM_1, dac_buffer, FRAME_SIZE * sizeof(int16_t),
                  &bytes_read, portMAX_DELAY);
    }
}

void app_main(void) {
    esp_anc_init();
    i2s_mic_init();
    xTaskCreatePinnedToCore(anc_task, "anc_task", 8192, NULL,
                            configMAX_PRIORITIES - 1, NULL, 1);
}

Using heap_caps_malloc(..., MALLOC_CAP_INTERNAL) ensures all buffers reside in internal SRAM, avoiding PSRAM access latency (typically 3-5× slower than SRAM) that could cause processing overruns. A task stack size of 8192 bytes is sufficient for the DSP call chain’s local variables.

The I2S output (I2S_NUM_1) drives an external DAC or Class-D amplifier, emitting anti-phase sound waves that cancel the original noise acoustically. Output values are clamped via fmaxf/fminf to prevent DAC clipping distortion.

The reason for pinning to CPU1 rather than CPU0: CPU0 in ESP-IDF handles the Wi-Fi and Bluetooth stacks. Wi-Fi hardware interrupts are high-priority and frequent—every beacon frame reception or TCP ACK triggers an interrupt. These interrupts preempt user tasks on CPU0, causing unpredictable delay jitter in ANC frame processing. CPU1 (APP_CPU) is unaffected by Wi-Fi interrupts, making it the better choice for real-time DSP tasks.

ESP32-S3 vs ESP32: The S3’s Xtensa LX7 processor adds vector instruction extensions (AE_* family) that can complete multiple multiply-accumulate (MAC) operations in a single cycle, making NLMS convolution updates approximately 3-5× faster than the ESP32 (LX6). Both cores on the ESP32 are LX6, which lacks vector extensions and has significantly lower DSP efficiency. If using an ESP32 (non-S3), consider reducing the filter order to 32 or shortening the frame to 64 samples to maintain real-time performance.

Build Configuration

The following options must be enabled in sdkconfig:

For ESP-IDF v5.0 and above, migrating to the new I2S driver model is recommended, though the processing logic remains the same.

Latency Budget

The total pipeline latency accumulates as follows:

StageLatency
I2S DMA capture1 frame (128 samples) ≈ 8 ms
ADC conversion + transfer~0.5 ms
Frame processing (NLMS)~2-3 ms
DAC conversion + transfer~0.5 ms
Acoustic path~0.1-0.3 ms
Total~11-12 ms

The following diagram breaks down the latency budget visually:

mermaid
flowchart TD
    A["I2S Capture<br/>~2ms"] --> B["DMA Transfer<br/>~0.1ms"]
    B --> C["NLMS Compute<br/>~3-5ms"]
    C --> D["I2S Output<br/>~1ms"]
    D --> TOTAL["Total Latency<br/>~6-8ms<br/>Effective Freq: ~60-80Hz"]

    classDef stage fill:#2196F3,color:#fff
    classDef result fill:#f44336,color:#fff
    class A,B,C,D stage
    class TOTAL result

The effective frequency ceiling for consumer ANC is approximately 1 / (2 × total latency) ≈ 40-45 Hz, so this approach primarily targets low-frequency rumble below 50 Hz—air conditioner compressors, fans, road noise. To increase the effective frequency range, reduce frame size, lower the filter order, or run on a more powerful MCU.