Building a Surveillance Camera with ESP32-S3 — WiFi, TF Card, Video Output Pitfalls
Why
We have a few parrots at home, and during the workday nobody’s around. I wanted to check in on them anytime. The requirement sounds simple: real-time video streaming, recording to storage, and ideally automatic backup to NAS. Off-the-shelf cameras are either expensive or require installing apps, registering accounts, and binding phone numbers — privacy concerns. I just want to watch my birds, not stream video to someone else’s server.
I happened to have a Seeed XIAO ESP32-S3 Sense dev board, with a built-in OV2640 camera and TF card slot. The ESP32-S3 has WiFi, dual-core 240MHz, 8MB external PSRAM — theoretically more than enough for a camera application. So I decided to build one myself: MiBeeHomeCam.
Since I planned to open-source it, I made the features as comprehensive as possible. The goal was clear: write a surveillance camera firmware from scratch using ESP-IDF, with MJPEG real-time streaming viewable in a browser, AVI segmented recording to TF card, FTP/WebDAV auto-upload to NAS, a Web management interface, and Prometheus metrics for monitoring integration. I also plan to build an NVR to collect video files for visual analysis later — but that’s a story for another post. This one only covers the camera itself.
Here’s what I ended up implementing:
- View real-time video by opening the device IP in a browser
- Auto-segmented recording (default 5-minute segments), loop storage to prevent full cards
- FTP/WebDAV auto-upload to NAS
- Web interface for config management, browsing/downloading/deleting recordings
- Prometheus
/metricsendpoint for monitoring integration - WiFi AP/STA dual mode, phone-based network config on first boot
Sounds comprehensive, but the pitfalls were numerous.
Hardware Selection
I used the Seeed XIAO ESP32-S3 Sense board — tiny (thumb-sized) but well-equipped:
- ESP32-S3 dual-core 240MHz, 8MB Octal PSRAM
- Onboard OV2640 camera (also compatible with OV3660)
- TF card slot (SPI mode, 1-bit SDMMC)
- WiFi 2.4GHz (note: no 5GHz support)
- USB-C power
Honestly, I chose this board mainly for its small size and built-in camera + TF card slot — no extra wiring needed. But as I hit pitfalls later, I discovered the board has its own “special” issues.
WiFi: Can’t Even Connect, Let Alone Monitor
WiFi was the first thing that drove me crazy. I wrote the firmware, flashed it, and eagerly expected to see video — but the device simply couldn’t connect to the router.
Auth timeout: auth expired / assoc expired
The serial console kept showing these messages:
| |
auth -> init (0x200) means authentication timeout, assoc -> init (0x400) means association timeout. The device could scan the router’s AP, but the authentication frames just wouldn’t get through.
After much troubleshooting — swapping routers, changing channels, changing passwords — nothing worked. Eventually I discovered that the XIAO ESP32-S3 board’s WiFi transmit power defaults to a low value. This is a known issue with Seeed boards manufactured before August 2025. The power was too low, and authentication frames were being dropped before reaching the router.
The fix is to manually crank up the transmit power to 15 dBm after esp_wifi_start():
| |
A single line of code, but finding the problem took me most of a day. It’s documented on Seeed’s official wiki, but you need to know the problem exists before you go looking. The ESP-IDF documentation doesn’t mention this hardware-level pitfall at all.
Event loop duplicate creation causing crash loop
There was another pitfall with WiFi module initialization. The device boots into AP mode (for network configuration), then after configuring WiFi, it reboots and switches to STA mode. But wifi_init() uses ESP_ERROR_CHECK with esp_event_loop_create_default(). If this function was already called during WiFi scanning, it returns ESP_ERR_INVALID_STATE, and ESP_ERROR_CHECK directly aborts — the device crashes, reboots, crashes again — infinite loop.
The log looks like this:
| |
The fix is to not use ESP_ERROR_CHECK — instead, handle the error gracefully:
| |
If it’s already INVALID_STATE, just ignore it — the event loop already exists, no need to crash. This essentially comes down to insufficient understanding of the ESP-IDF event model; the docs don’t explicitly state that this function can’t be called repeatedly.
Other WiFi Notes
ESP32-S3 only supports 2.4 GHz — it won’t find 5 GHz router signals. This isn’t a bug but worth knowing. Also, AP/STA dual-mode design requires careful state management — you can’t access the external network in AP mode, and you can’t start SNTP time sync while STA isn’t connected. My approach uses a 19-step startup sequence: after WiFi init, the subsequent steps depend on the current mode, and time sync and recording start only after STA connects successfully.
The WiFi mode switching flow looks like this:
flowchart TD
A["Power on"] --> B@{shape: diam, label: "WiFi config exists?"}
B -->|"No"| C["Enter AP hotspot mode"]
C --> D["Phone config at 192.168.4.1"]
D --> E["Save credentials and reboot"]
B -->|"Yes"| F["STA connect to router"]
E --> F
F --> G@{shape: diam, label: "Connected?"}
G -->|"Yes"| H["Sync time - Start recording"]
G -->|"No"| F
classDef decision fill:#fff3e0,stroke:#ff9800,stroke-width:2px
classDef startup fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
classDef config fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
classDef running fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px
class B,G decision
class A startup
class C,D,E config
class F,H runningTF Card: Mounting is Just the First Step
TF card issues were more subtle than WiFi. When WiFi doesn’t connect, you can at least see it in the logs. TF card problems sometimes appear to “work normally” when the data is actually corrupt.
SPI high-speed mode initialization failure
The XIAO ESP32-S3 Sense TF card slot uses SPI mode (1-bit SDMMC), not SDMMC 4-bit mode. This means the speed ceiling is inherently lower than standard SDMMC. I initially set the SPI clock to SDMMC_FREQ_HIGHSPEED (40MHz), thinking faster is better.
Result: many TF cards failed to initialize:
| |
After extensive investigation, I found that the CMD6 high-speed mode switch command fails under 40MHz clock in SPI mode — not all TF cards support high-speed SPI negotiation. Some cards pass, some don’t, and even the same card model from a different batch might fail.
I settled on 20MHz (SDMMC_FREQ_DEFAULT) for reliable operation:
| |
The lesson: SPI mode speed capability differs from SDMMC (4-bit) mode. Parameter changes must be validated on actual hardware. Something that compiles fine may be completely broken at runtime.
TF Card Format and Compatibility
The format must be FAT32 — exFAT and NTFS are not supported. This is common knowledge in embedded development, but if your TF card is 64GB or larger, the factory format is likely exFAT and needs manual reformatting. Windows right-click format may not show FAT32 as an option for cards over 32GB — you’ll need third-party tools.
For speed rating, Class 10 or above is mandatory. Recording write speed must keep up, or frames will be dropped. Stick with known brands — generic card read/write speeds fluctuate too much.
Zero-byte Recording Files
This issue plagued me for a while. The file manager would show 0-byte .avi files. While they didn’t affect functionality, they were annoying to look at.
The root cause was that the segmented recording callback didn’t check the actual bytes written. When the SD card had issues, a segment file would be opened and immediately closed, registering a 0-byte entry in the file cache.
The fix was simple — add a check before the callback:
| |
Hot-plug Detection
TF card hot-plugging is a real requirement in daily use. My solution runs a monitoring task on Core 1 that polls SD card status every 10 seconds. When the card is removed, recording stops and the LED shows an error state. When a card is inserted, it remounts and automatically resumes recording. The key is that after detecting removal, you can’t immediately try to remount — you must wait for an insertion signal; otherwise it enters a frantic retry loop.
The entire hot-plug state machine looks like this:
flowchart TD
A["Power on - Init TF card"] --> B["Mounted"]
B --> C["Auto start recording<br/>5-min segment write"]
C --> E["TF card removed detected<br/>LED double-blink error"]
E --> G["TF card inserted detected"]
G --> H["Remount successful"]
H --> C
classDef mounted fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
classDef recording fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
classDef removed fill:#ffebee,stroke:#f44336,stroke-width:2px
classDef inserted fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px
class A,B,H mounted
class C recording
class E removed
class G insertedVideo Output: Recording ≠ Playable
Video output was the most headache-inducing part. The camera could capture frames, write files, and file sizes looked normal — but when downloaded and played, all sorts of errors appeared.
AVI Header Offset Errors
This was the most classic bug. Recorded AVI files showed “0 frames” in media players — the file appeared to have data but simply wouldn’t play.
The root cause was incorrect backfill offsets in close_segment() for the AVI header:
dwTotalFrameswas written at offset 40, but should be at 48strh dwLengthwas written at offset 136, but should be at 140
Off by a few bytes meant the player read frame count and duration as 0, thinking the file contained no video data.
The correct calculation:
| |
The AVI file format is a RIFF container where each layer has a fixed-size header. When manually constructing AVI files, every field’s offset must be precisely calculated. Being off by a single byte corrupts the entire file. During debugging, I used a hex editor to byte-by-byte compare against the AVI specification — a truly memorable experience.
Missing biSize Field Causing “Codec Not Supported”
This issue was even more insidious. VLC reported “Codec not supported: VLC cannot decode the format,” and file properties showed absurd video dimensions (like 600x1572865).
Using VLC’s detailed log mode:
| |
I found that biWidth and biHeight values were completely wrong, and the biCompression field wasn’t “MJPG” but garbage data. But the code clearly wrote the correct values — why were they wrong when read back?
After investigation, the root cause was that the write_hdrl() function, when generating AVI’s BITMAPINFOHEADER (strf chunk), omitted the biSize field. This field is a constant value of 40, indicating the header structure size. The strf chunk declared 40 bytes of data but only wrote 36 bytes, causing all subsequent fields to be offset by 4 bytes:
| Offset | Should Write | Actually Wrote |
|---|---|---|
| biSize (0) | 40 | 800 (overwritten by biWidth) |
| biWidth (4) | 800 | 600 (overwritten by biHeight) |
| biHeight (8) | 600 | garbage |
| biCompression (20) | “MJPG” | garbage |
The fix is to add back the missing biSize:
| |
biSize seems useless (always 40), but the player relies on it to determine the header size and jump to the correct position to read subsequent data. Without those 4 bytes, the entire file structure parsing is wrong.
This bug taught me one thing: there are no “optional” fields in BITMAPINFOHEADER — every field is read by the player. Missing one causes complete failure. When debugging AVI file structure issues, start by verifying every field in the strf chunk, not just the RIFF/LIST hierarchy level.
Can’t Download Files After Stopping Recording
Here’s another annoying issue: trying to download a file immediately after stopping recording would return a 409 “File is currently being recorded” error. The recording was clearly stopped — why was it still reporting as recording?
The cause was that the s_current_file global variable wasn’t cleared after the recording task exited. The download API checked the current file name via recorder_get_current_file(), which always returned the previous file path, causing closed files to be falsely identified as currently recording.
The fix was to add a single line in the recording task cleanup code:
| |
MJPEG Real-time Stream Bandwidth Bottleneck
The main limitation for real-time video streaming is bandwidth. The ESP32-S3’s WiFi throughput is limited — an SVGA (800×600) 10fps MJPEG stream would visibly stutter under poor signal conditions. My approach limits concurrent clients to a maximum of 2; exceeding that returns a 503 error. In practice, dropping the resolution to VGA (640×480) or increasing the JPEG quality parameter (higher value = more compression, worse quality but less bandwidth) makes streaming much smoother in average WiFi signal conditions.
Recording Data Flow
A complete recording, from capture to upload, follows this path:
flowchart TD
A["Camera captures frame"] --> B["Write to AVI file"]
B --> C@{shape: diam, label: "Segment duration reached?"}
C -->|"No"| A
C -->|"Yes"| D["Backfill AVI header"]
D --> E["Callback: add to upload queue"]
E --> F["Upload to NAS"]
F --> G@{shape: diam, label: "Free space < 20%?"}
G -->|"Yes"| H["Delete oldest recording"]
G -->|"No"| A
H --> A
classDef capture fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
classDef process fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
classDef decision fill:#fff3e0,stroke:#ff9800,stroke-width:2px
classDef upload fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px
classDef delete fill:#ffebee,stroke:#f44336,stroke-width:2px
class A capture
class B,D,E process
class C,G decision
class F upload
class H deleteOverall Architecture
The system is based on FreeRTOS with dual-core分工 — Core 0 handles real-time tasks to ensure no frame loss, Core 1 handles non-real-time tasks:
flowchart TD
A["Camera captures frame"] --> B@{shape: hex, label: "PSRAM double buffer"}
B --> C["Recording task Core0"]
C --> D@{shape: doc, label: "TF card AVI segments"}
D --> E["NAS upload Core1"]
E --> F["FTP / WebDAV"]
B --> G["MJPEG stream"]
G --> H["Browser :80"]
classDef capture fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
classDef recording fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
classDef upload fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px
classDef streaming fill:#fff3e0,stroke:#ff9800,stroke-width:2px
class A,B capture
class C,D recording
class E,F upload
class G,H streamingThe core design is dual-core分工: Core 0 runs the recording task (high priority, ensures frame capture isn’t preempted causing frame loss), and Core 1 handles upload, SD monitoring, health checks, and other non-real-time tasks. Camera frame buffers use PSRAM double buffering — this wouldn’t be feasible without PSRAM.
The startup sequence is a 19-step initialization in app_main(). Critical step failures log errors but continue execution (graceful degradation) rather than crashing. TF card hot-plug detection polls every 10 seconds, and loop storage deletes the oldest recordings when free space drops below 20%. A watchdog timer is set to 30 seconds — if it truly hangs, a panic restart occurs.
The Web management interface has 4 pages: dashboard for status, configuration page for parameters, file manager for browsing/downloading/deleting recordings, and preview page for real-time streaming. All configuration is persisted via NVS, with optional TF card config file override (for batch deployment).
Closing Thoughts
Looking back, the hardest part wasn’t writing feature code — it was debugging those “the code looks fine but it doesn’t work” bugs. Low WiFi transmit power is a hardware issue, TF card SPI high-speed mode instability is a protocol compatibility issue, AVI header offset errors are a binary file format issue — none of these could be found by reading code. They all required running on actual hardware, using logs and tools to investigate step by step.
A few lessons I found particularly valuable:
- Manually constructing AVI/RIFF file formats is painful. Every field’s offset must be precisely calculated. After writing, use a hex editor to verify against the specification byte by byte — don’t wait for the media player to complain.
- SPI mode TF card speed is limited. Don’t assume 40MHz will work; 20MHz stability is more important than anything else.
- XIAO ESP32-S3 WiFi power must be set manually. Otherwise, some routers simply won’t connect. Seeed’s official docs mention this, but it’s easy to overlook.
- VLC’s
-vvv --file-loggingis a lifesaver for debugging video files. You can see the actual parsed values of every field — much faster than guessing.
The project is open source on GitHub: Mi-Bee-Studio/seeed-esp32s3-cam. Firmware can be downloaded from Releases and flashed directly without compiling from source.