← blog

Giving Your AI Agent Ears

People send voice messages. Agents read text. This mismatch is surprisingly annoying when you're running AI agents on Telegram.

Here's how we added local voice transcription to our agents using whisper.cpp. No OpenAI API calls, no subscriptions, no cloud dependencies. A compiled binary, a 142MB model file, and a bash script.

The Setup

We run two AI agents on separate VPS instances. Ocean lives on a 16GB Hetzner box (4 cores). Krill lives on a 4GB DigitalOcean droplet (2 cores). Both run Ubuntu 24.04.

The pipeline is simple:

  1. Telegram delivers a voice message as an OGG file
  2. ffmpeg converts it to 16kHz mono WAV
  3. whisper-cli transcribes it locally
  4. The transcript is posted as a quoted reply, then passed to the LLM

Choosing the Right Model

Whisper comes in several sizes. We learned this the hard way:

ggml-tiny.bin     75MB   ~3s on 4 cores   rough quality
ggml-base.bin    142MB   ~10s on 4 cores   good quality
ggml-small.bin   466MB   ~15s on 4 cores   great quality

On the 16GB/4-core box, the small model works fine. On the 4GB/2-core box, it timed out on a 45-second voice message. Switched to base and it completes in ~36 seconds. Quality is good enough for conversation.

The lesson: match your model to your hardware. A fast mediocre transcription beats a perfect one that times out.

The Wrapper Script

The entire transcription layer is a 15-line bash script:

#!/usr/bin/env bash
set -euo pipefail

MODEL="$HOME/.local/share/whisper-cpp/ggml-base.bin"
INPUT="$1"
TMPWAV=$(mktemp /tmp/whisper-XXXXXX.wav)

trap 'rm -f "$TMPWAV"' EXIT

# Convert any audio to 16kHz mono WAV
ffmpeg -i "$INPUT" -ar 16000 -ac 1 -c:a pcm_s16le \
  "$TMPWAV" -y -loglevel error 2>/dev/null

# Transcribe
whisper-cli -m "$MODEL" -f "$TMPWAV" --no-timestamps 2>/dev/null

That's it. The wrapper handles format conversion (Telegram sends OGG Opus), temporary file cleanup, and pipes the transcript to stdout where the agent framework picks it up.

The Quote Trick

Raw transcription is useful but not enough. When someone sends a voice message, they want to know what the agent heard. So we added an automatic quoted reply:

🎙️ Hey, can you check the deployment status?

This posts instantly as a Telegram reply to the original voice message, before the agent even starts thinking. The sender gets immediate confirmation their message was understood. If the transcription is wrong, they can correct it before the agent acts on garbage.

What It Costs

Nothing. Zero ongoing cost. The whisper model runs on CPU, the binary is compiled from source, ffmpeg is a system package. Compare that to the Whisper API at $0.006/minute. For an agent that processes dozens of voice messages daily, local inference pays for itself in weeks.

The only real cost is ~10-36 seconds of CPU time per message, depending on your hardware and model choice. For async messaging, that's invisible.

Replicating Across Servers

When we needed to add transcription to our second agent, we just tarred the binary + shared libs + model and SCP'd it over. Same Ubuntu version means binary compatibility. Total time from "can't transcribe" to "fully working": 5 minutes.

# Bundle everything
tar czf whisper-bundle.tar.gz \
  /usr/local/bin/whisper-cli \
  /usr/local/lib/libwhisper* \
  /usr/local/lib/libggml* \
  ~/.local/share/whisper-cpp/ggml-base.bin

# Ship it
scp whisper-bundle.tar.gz user@other-server:/tmp/
ssh user@other-server "cd /tmp && tar xzf whisper-bundle.tar.gz && sudo ldconfig"

The Takeaway

Voice transcription for AI agents is a solved problem. You don't need cloud APIs. You don't need GPU instances. A $5/month VPS with 2 cores and 4GB RAM can transcribe voice messages locally with acceptable quality and latency.

The real insight: your agent's ability to understand voice messages isn't a feature. It's table stakes. People use voice because it's faster than typing. If your agent can't handle that, you're forcing your users to change their behavior for your limitations.

Give your agent ears. It's 142 megabytes and a bash script.