XiaoZhi-Compatible ESP32 Voice Assistant Build Checklist

This checklist is for an ESP32-S3 voice assistant style build that is compatible with the XiaoZhi voice-assistant direction: the ESP32 records audio, sends it to a server, plays back TTS, shows status on LEDs, and can control a simple device such as a fan. It is written for bring-up and troubleshooting, not as a claim that the upstream XiaoZhi firmware is authored here.

What this build proves

Bench hardware map

Use one module at a time before installing everything in a case. The local bench notes use this mapping:

Module Signal ESP32-S3 pin
16×16 RGB LED matrix DIN GPIO8
I2S microphone BCLK GPIO5
I2S microphone WS/LRCLK GPIO4
I2S microphone SD GPIO6
I2S amplifier / speaker BCLK GPIO14
I2S amplifier / speaker LRC GPIO15
I2S amplifier / speaker DIN GPIO7
Low-voltage fan driver S/control GPIO13

If your LED test sketch uses another pin, change the firmware constant before testing. Do not assume every public demo uses the same pinout.

Bring-up order

  1. Flash a minimal serial test first. Confirm the board name, USB port, baud rate, and reset behavior.
  2. Run the LED matrix alone. Check the matrix direction, serpentine layout, and brightness limit before adding audio.
  3. Run the microphone level test. Print RMS or peak values to serial while the room is quiet and while speaking near the mic.
  4. Run the speaker test. Play a short fixed sample before connecting live TTS.
  5. Connect WiFi and call the backend health endpoint. Do not debug audio until the network path is stable.
  6. Test /chat-stream with a short utterance. Keep the first recording under a few seconds so failures are easy to isolate.
  7. Add the fan or other low-voltage output last. Keep the load power separate from USB when current is high, and share ground where the driver requires it.

Wake and recording stability

The most common failure is not the language model. It is recording logic. Use a short noise calibration window at startup, require several consecutive frames above the threshold before recording, keep a small pre-roll buffer so the first syllable is not cut, and mute recording briefly after TTS playback so the device does not hear itself. Track stop reasons such as silence timeout, maximum recording time, manual stop, and backend error.

Server-side responsibilities

Keep the ESP32 firmware simple. The device should capture audio, stream bytes, play response audio, show state, and switch simple outputs. ASR, TTS, weather tools, model calls, logging, and prompt changes should live on the server so you can improve them without reflashing every board.

Troubleshooting

Open-source note

XiaoZhi-compatible builds often reference public firmware and community work. Preserve upstream license files, copyright notices, commit history, and links. If you publish a cleaned repo later, make it clear which parts are upstream, which parts are local wiring or documentation, and which maintenance tasks were AI-assisted.

ESP32-C3 Voice AI Kit / ESP32-S3 Voice Dev Kit / ESP32-S3 AI Voice Display Box

External references: Espressif ESP32-S3 getting started and Arduino documentation.

Related kit

If you want the same parts, here is the closest kit.

View Kits