Voice (TTS & STT)

OpenCrabs supports text-to-speech and speech-to-text with five provider tiers: Off, Groq (API), OpenAI-compatible (any /v1/audio endpoint), Voicebox (self-hosted), or Local (on-device, zero cost).

Quick Setup

Run /onboard:voice in the TUI to configure everything interactively. The voice screen has radio selectors for both STT and TTS, with fields shown/hidden based on the selected provider. API keys are wired to keys.toml automatically.

Speech-to-Text (STT)

Providers

Provider	Engine	Cost	Latency	Setup
Groq	Whisper (`whisper-large-v3-turbo`)	Per-minute pricing	~1s	API key in `keys.toml`
OpenAI-compatible	Any Whisper-compatible endpoint	Varies	~1-3s	`stt_base_url` + `stt_model` + API key
Voicebox	Self-hosted open-source	Free	~2-5s	`voicebox_stt_enabled=true` + `voicebox_stt_base_url`
Local	whisper.cpp (on-device)	Free	~2-5s	Auto-downloads model

Local STT Models

Model	Size	Quality	Speed
`local-tiny`	~75 MB	Good for short messages	Fastest
`local-base`	~142 MB	Better accuracy	Fast
`local-small`	~466 MB	High accuracy	Moderate
`local-medium`	~1.5 GB	Best accuracy	Slower

Models auto-download from HuggingFace to ~/.local/share/opencrabs/models/whisper/ on first use.

Configuration

# config.toml
[voice]
stt_enabled = true
stt_mode = "local"              # "api" or "local"
local_stt_model = "local-tiny"  # local-tiny, local-base, local-small, local-medium

For API mode:

# keys.toml
[providers.stt.groq]
api_key = "your-groq-key"       # From console.groq.com

Text-to-Speech (TTS)

Providers

Provider	Engine	Cost	Voices	Setup
OpenAI	`gpt-4o-mini-tts`	Per-character pricing	alloy, echo, fable, onyx, nova, shimmer	API key in `keys.toml`
OpenAI-compatible	Any `/v1/audio/speech` endpoint	Varies	Varies by server	`tts_base_url` + `tts_model` + `tts_voice` + API key
Voicebox	Self-hosted async `POST /generate`	Free	Configurable profiles	`voicebox_tts_enabled=true` + `voicebox_tts_base_url` + `voicebox_tts_profile_id`
Local	Piper (on-device)	Free	6 voices	Auto-downloads model

Local TTS Voices (Piper)

Voice	Description
`ryan`	US Male (default)
`amy`	US Female
`lessac`	US Female
`kristin`	US Female
`joe`	US Male
`cori`	UK Female

Models auto-download from HuggingFace to ~/.local/share/opencrabs/models/piper/. A Python venv is created automatically for the Piper runtime.

Configuration

# config.toml
[voice]
tts_enabled = true
tts_mode = "local"              # "api" or "local"
local_tts_voice = "ryan"        # ryan, amy, lessac, kristin, joe, cori

For API mode:

# config.toml
[voice]
tts_mode = "api"
tts_voice = "echo"              # OpenAI voice name
tts_model = "gpt-4o-mini-tts"   # OpenAI model

# keys.toml
[providers.tts.openai]
api_key = "your-openai-key"

Full Configuration Reference

# config.toml
[voice]
# Speech-to-Text
stt_enabled = true
stt_mode = "groq"                 # "groq", "openai_compatible", "voicebox", "local"
local_stt_model = "local-tiny"    # local-tiny, local-base, local-small, local-medium
stt_base_url = "https://..."      # OpenAI-compatible STT endpoint
stt_model = "whisper-1"           # OpenAI-compatible STT model
voicebox_stt_enabled = false
voicebox_stt_base_url = "https://..."

# Text-to-Speech
tts_enabled = true
tts_mode = "openai"               # "openai", "openai_compatible", "voicebox", "local"
tts_voice = "echo"                # OpenAI TTS voice name
tts_model = "gpt-4o-mini-tts"     # OpenAI TTS model
local_tts_voice = "ryan"          # Local mode: Piper voice
tts_base_url = "https://..."      # OpenAI-compatible TTS endpoint
tts_model = "tts-1"               # OpenAI-compatible TTS model
voicebox_tts_enabled = false
voicebox_tts_base_url = "https://..."
voicebox_tts_profile_id = "profile-id"

# keys.toml
[providers.stt.groq]
api_key = "your-groq-key"

[providers.stt.openai_compatible]
api_key = "your-api-key"

[providers.tts.openai]
api_key = "your-openai-key"

[providers.tts.openai_compatible]
api_key = "your-api-key"

How Voice Messages Work

When a voice message arrives on Telegram, WhatsApp, Discord, or Slack:

Audio is decoded (OGG/Opus or WAV)
Transcribed via STT (local whisper.cpp or Groq API)
Agent processes the text and generates a response
Response is converted to speech via TTS (local Piper or OpenAI API)
Audio is encoded as OGG/Opus and sent back as a voice message

Local mode handles everything on-device — no API calls, no cost, no data leaves your machine.

Hardware Requirements

Feature	CPU Requirement	Notes
Local STT (rwhisper)	AVX2 (Haswell 2013+)	Metal GPU on macOS Apple Silicon
Local TTS (Piper)	No restrictions	Tested on 2007 iMac — works on any x86/ARM
Local embeddings	AVX (Sandy Bridge 2011+)	Falls back to FTS-only search

OpenCrabs detects CPU capabilities at runtime and hides unavailable options in the onboarding wizard. Local TTS (Piper) has no CPU limitations and should work on virtually any machine.

Building Without Voice

Voice features are enabled by default. To build without them (smaller binary):

cargo build --release --no-default-features --features telegram,whatsapp,discord,slack,trello

Feature flags: local-stt (whisper.cpp), local-tts (Piper).

Keyboard shortcuts

OpenCrabs