XiaoZhi-Compatible ESP32 Voice Assistant Build Checklist
This checklist is for an ESP32-S3 voice assistant style build that is compatible with the XiaoZhi voice-assistant direction: the ESP32 records audio, sends it to a server, plays back TTS, shows status on LEDs, and can control a simple device such as a fan. It is written for bring-up and troubleshooting, not as a claim that the upstream XiaoZhi firmware is authored here.
What this build proves
- The ESP32-S3 boots reliably and connects to WiFi.
- The I2S microphone produces stable audio levels instead of silence or clipped noise.
- The I2S speaker path can play TTS audio loudly enough for a desktop demo.
- The LED matrix shows listening, thinking, speaking, and error states.
- The backend audio path works through
/chat-stream, with/chat-voiceas a fallback test path. - A simple load such as a low-voltage fan can be switched without disrupting audio.
Bench hardware map
Use one module at a time before installing everything in a case. The local bench notes use this mapping:
| Module | Signal | ESP32-S3 pin |
|---|---|---|
| 16×16 RGB LED matrix | DIN | GPIO8 |
| I2S microphone | BCLK | GPIO5 |
| I2S microphone | WS/LRCLK | GPIO4 |
| I2S microphone | SD | GPIO6 |
| I2S amplifier / speaker | BCLK | GPIO14 |
| I2S amplifier / speaker | LRC | GPIO15 |
| I2S amplifier / speaker | DIN | GPIO7 |
| Low-voltage fan driver | S/control | GPIO13 |
If your LED test sketch uses another pin, change the firmware constant before testing. Do not assume every public demo uses the same pinout.
Bring-up order
- Flash a minimal serial test first. Confirm the board name, USB port, baud rate, and reset behavior.
- Run the LED matrix alone. Check the matrix direction, serpentine layout, and brightness limit before adding audio.
- Run the microphone level test. Print RMS or peak values to serial while the room is quiet and while speaking near the mic.
- Run the speaker test. Play a short fixed sample before connecting live TTS.
- Connect WiFi and call the backend health endpoint. Do not debug audio until the network path is stable.
- Test
/chat-streamwith a short utterance. Keep the first recording under a few seconds so failures are easy to isolate. - Add the fan or other low-voltage output last. Keep the load power separate from USB when current is high, and share ground where the driver requires it.
Wake and recording stability
The most common failure is not the language model. It is recording logic. Use a short noise calibration window at startup, require several consecutive frames above the threshold before recording, keep a small pre-roll buffer so the first syllable is not cut, and mute recording briefly after TTS playback so the device does not hear itself. Track stop reasons such as silence timeout, maximum recording time, manual stop, and backend error.
Server-side responsibilities
Keep the ESP32 firmware simple. The device should capture audio, stream bytes, play response audio, show state, and switch simple outputs. ASR, TTS, weather tools, model calls, logging, and prompt changes should live on the server so you can improve them without reflashing every board.
Troubleshooting
- Only noise from the microphone: check BCLK, WS, and SD pins, then verify the microphone voltage and the I2S sample format.
- Audio response is delayed: test the WebSocket endpoint from a browser or desktop client and check server logs before changing firmware.
- Speaker clicks when LEDs update: reduce matrix brightness, shorten peak current paths, and avoid powering the amplifier from a weak USB port.
- Fan switching resets the board: use a proper driver module, separate load power, a shared ground when required, and a flyback path for inductive loads.
- Wake is too sensitive: recalibrate ambient noise, raise the trigger margin, and require more consecutive active frames.
Open-source note
XiaoZhi-compatible builds often reference public firmware and community work. Preserve upstream license files, copyright notices, commit history, and links. If you publish a cleaned repo later, make it clear which parts are upstream, which parts are local wiring or documentation, and which maintenance tasks were AI-assisted.
Related products and references
ESP32-C3 Voice AI Kit / ESP32-S3 Voice Dev Kit / ESP32-S3 AI Voice Display Box
External references: Espressif ESP32-S3 getting started and Arduino documentation.