XiaoZhi-Compatible ESP32 Voice Assistant Build Checklist

This checklist is for an ESP32-S3 voice assistant style build that is compatible with the XiaoZhi voice-assistant direction: the ESP32 records audio, sends it to a server, plays back TTS, shows status on LEDs, and can control a simple device such as a fan. It is written for bring-up and troubleshooting, not as a claim that the upstream XiaoZhi firmware is authored here.

What this build proves

The ESP32-S3 boots reliably and connects to WiFi.
The I2S microphone produces stable audio levels instead of silence or clipped noise.
The I2S speaker path can play TTS audio loudly enough for a desktop demo.
The LED matrix shows listening, thinking, speaking, and error states.
The backend audio path works through /chat-stream, with /chat-voice as a fallback test path.
A simple load such as a low-voltage fan can be switched without disrupting audio.

Bench hardware map

Use one module at a time before installing everything in a case. The local bench notes use this mapping:

Module	Signal	ESP32-S3 pin
16×16 RGB LED matrix	DIN	GPIO8
I2S microphone	BCLK	GPIO5
I2S microphone	WS/LRCLK	GPIO4
I2S microphone	SD	GPIO6
I2S amplifier / speaker	BCLK	GPIO14
I2S amplifier / speaker	LRC	GPIO15
I2S amplifier / speaker	DIN	GPIO7
Low-voltage fan driver	S/control	GPIO13

If your LED test sketch uses another pin, change the firmware constant before testing. Do not assume every public demo uses the same pinout.

Bring-up order

Flash a minimal serial test first. Confirm the board name, USB port, baud rate, and reset behavior.
Run the LED matrix alone. Check the matrix direction, serpentine layout, and brightness limit before adding audio.
Run the microphone level test. Print RMS or peak values to serial while the room is quiet and while speaking near the mic.
Run the speaker test. Play a short fixed sample before connecting live TTS.
Connect WiFi and call the backend health endpoint. Do not debug audio until the network path is stable.
Test /chat-stream with a short utterance. Keep the first recording under a few seconds so failures are easy to isolate.
Add the fan or other low-voltage output last. Keep the load power separate from USB when current is high, and share ground where the driver requires it.

Wake and recording stability

The most common failure is not the language model. It is recording logic. Use a short noise calibration window at startup, require several consecutive frames above the threshold before recording, keep a small pre-roll buffer so the first syllable is not cut, and mute recording briefly after TTS playback so the device does not hear itself. Track stop reasons such as silence timeout, maximum recording time, manual stop, and backend error.

Server-side responsibilities

Keep the ESP32 firmware simple. The device should capture audio, stream bytes, play response audio, show state, and switch simple outputs. ASR, TTS, weather tools, model calls, logging, and prompt changes should live on the server so you can improve them without reflashing every board.

Troubleshooting

Only noise from the microphone: check BCLK, WS, and SD pins, then verify the microphone voltage and the I2S sample format.
Audio response is delayed: test the WebSocket endpoint from a browser or desktop client and check server logs before changing firmware.
Speaker clicks when LEDs update: reduce matrix brightness, shorten peak current paths, and avoid powering the amplifier from a weak USB port.
Fan switching resets the board: use a proper driver module, separate load power, a shared ground when required, and a flyback path for inductive loads.
Wake is too sensitive: recalibrate ambient noise, raise the trigger margin, and require more consecutive active frames.

Open-source note

XiaoZhi-compatible builds often reference public firmware and community work. Preserve upstream license files, copyright notices, commit history, and links. If you publish a cleaned repo later, make it clear which parts are upstream, which parts are local wiring or documentation, and which maintenance tasks were AI-assisted.

ESP32-C3 Voice AI Kit / ESP32-S3 Voice Dev Kit / ESP32-S3 AI Voice Display Box

External references: Espressif ESP32-S3 getting started and Arduino documentation.