Giving Your AI Agent Ears
People send voice messages. Agents read text. This mismatch is surprisingly annoying when you're running AI agents on Telegram.
Here's how we added local voice transcription to our agents using whisper.cpp. No OpenAI API calls, no subscriptions, no cloud dependencies. A compiled binary, a 142MB model file, and a bash script.
The Setup
We run two AI agents on separate VPS instances. Ocean lives on a 16GB Hetzner box (4 cores). Krill lives on a 4GB DigitalOcean droplet (2 cores). Both run Ubuntu 24.04.
The pipeline is simple:
- Telegram delivers a voice message as an OGG file
- ffmpeg converts it to 16kHz mono WAV
- whisper-cli transcribes it locally
- The transcript is posted as a quoted reply, then passed to the LLM
Choosing the Right Model
Whisper comes in several sizes. We learned this the hard way:
ggml-tiny.bin 75MB ~3s on 4 cores rough quality
ggml-base.bin 142MB ~10s on 4 cores good quality
ggml-small.bin 466MB ~15s on 4 cores great quality
On the 16GB/4-core box, the small model works fine. On the 4GB/2-core box, it timed out on a 45-second voice message. Switched to base and it completes in ~36 seconds. Quality is good enough for conversation.
The lesson: match your model to your hardware. A fast mediocre transcription beats a perfect one that times out.
The Wrapper Script
The entire transcription layer is a 15-line bash script:
#!/usr/bin/env bash
set -euo pipefail
MODEL="$HOME/.local/share/whisper-cpp/ggml-base.bin"
INPUT="$1"
TMPWAV=$(mktemp /tmp/whisper-XXXXXX.wav)
trap 'rm -f "$TMPWAV"' EXIT
# Convert any audio to 16kHz mono WAV
ffmpeg -i "$INPUT" -ar 16000 -ac 1 -c:a pcm_s16le \
"$TMPWAV" -y -loglevel error 2>/dev/null
# Transcribe
whisper-cli -m "$MODEL" -f "$TMPWAV" --no-timestamps 2>/dev/null
That's it. The wrapper handles format conversion (Telegram sends OGG Opus), temporary file cleanup, and pipes the transcript to stdout where the agent framework picks it up.
The Quote Trick
Raw transcription is useful but not enough. When someone sends a voice message, they want to know what the agent heard. So we added an automatic quoted reply:
🎙️ Hey, can you check the deployment status?
This posts instantly as a Telegram reply to the original voice message, before the agent even starts thinking. The sender gets immediate confirmation their message was understood. If the transcription is wrong, they can correct it before the agent acts on garbage.
What It Costs
Nothing. Zero ongoing cost. The whisper model runs on CPU, the binary is compiled from source, ffmpeg is a system package. Compare that to the Whisper API at $0.006/minute. For an agent that processes dozens of voice messages daily, local inference pays for itself in weeks.
The only real cost is ~10-36 seconds of CPU time per message, depending on your hardware and model choice. For async messaging, that's invisible.
Replicating Across Servers
When we needed to add transcription to our second agent, we just tarred the binary + shared libs + model and SCP'd it over. Same Ubuntu version means binary compatibility. Total time from "can't transcribe" to "fully working": 5 minutes.
# Bundle everything
tar czf whisper-bundle.tar.gz \
/usr/local/bin/whisper-cli \
/usr/local/lib/libwhisper* \
/usr/local/lib/libggml* \
~/.local/share/whisper-cpp/ggml-base.bin
# Ship it
scp whisper-bundle.tar.gz user@other-server:/tmp/
ssh user@other-server "cd /tmp && tar xzf whisper-bundle.tar.gz && sudo ldconfig"
The Takeaway
Voice transcription for AI agents is a solved problem. You don't need cloud APIs. You don't need GPU instances. A $5/month VPS with 2 cores and 4GB RAM can transcribe voice messages locally with acceptable quality and latency.
The real insight: your agent's ability to understand voice messages isn't a feature. It's table stakes. People use voice because it's faster than typing. If your agent can't handle that, you're forcing your users to change their behavior for your limitations.
Give your agent ears. It's 142 megabytes and a bash script.