Hands-free with voice
Drive agents by talking — pick the right dictation engine for your privacy and accuracy needs, use chat vs command mode well, and build a workflow where you prompt, approve, and steer without a keyboard.
The fastest prompt is the one you speak. On a phone, dictation isn't a fallback for when typing is inconvenient — it's the primary way to drive an agent when you're walking, cooking, or just thinking out loud. A spoken "add a nullable deleted_at column to users, write the migration, then show me the diff" beats thumb-typing it every time. This guide is about making voice a real, reliable part of your workflow — not a gimmick you try once.
Voice pairs naturally with everything else in Moshi: you dictate the prompt, the agent runs inside tmux, and approvals come back to your lock screen. The keyboard barely enters the loop.
What you'll learn
- Choose between on-device, local, and cloud dictation engines
- Use chat mode and command mode for the jobs each is best at
- Dictate punctuation, code-ish terms, and edits cleanly
- Combine voice with the toolbar and image paste
- Build a genuinely hands-free prompt → approve → review loop
Pick an engine
Moshi offers three dictation backends. They trade off privacy, accuracy, and connectivity differently — choose by what matters to you.
- Apple SpeechAnalyzer is the no-setup default: fully on-device, private, and works in airplane mode. Great for everyday prompting.
- Local Whisper runs a model on the device for stronger handling of jargon and code-adjacent words, still without sending audio anywhere.
- Hosted cloud offers the highest accuracy and helps on older devices, at the cost of sending audio to a server and needing a connection.
Set your engine in Moshi's Settings; the full matrix and language notes are in Voice and dictation.
If you dictate sensitive content, prefer an on-device engine (SpeechAnalyzer or local Whisper) so audio never leaves your phone. Cloud is a deliberate trade of privacy for accuracy.
Two modes, two jobs
Moshi separates composing from executing, and you'll switch between them constantly.
Chat mode — compose, then send
Chat mode turns speech into editable text you review before sending. Use it for anything with detail: multi-step instructions, anything where a mis-transcribed word would matter, or prompts you want to tweak before the agent acts.
"Refactor the payment service to use the new client, keep the public interface unchanged, and add a test for the retry path." → appears as text → you fix one word → send.
Command mode — straight to the shell
Command mode types your words directly into the terminal. Use it for short, low-risk commands where round-tripping through an edit box is just friction:
"git status" · "run the tests" · "clear"
The rule of thumb: chat mode for prompts to the agent, command mode for commands to the shell. When in doubt, chat mode — the edit step is cheap insurance.
Dictating code-shaped language
You're not dictating prose; you're dictating instructions full of symbols and identifiers. A few habits help:
- Speak punctuation by name where it matters — "open paren", "dash", "underscore" — or dictate naturally and fix symbols in chat mode before sending.
- Prefer describing intent over spelling exact tokens: "the deleted-at column" lets the agent map it to
deleted_at, instead of fighting the transcriber over an underscore. - For non-Latin input, see CJK input — dictation and IME concerns overlap there.
When a term simply won't transcribe, fall back to the keyboard toolbar for the exact characters and let voice handle the rest of the sentence. Voice and typing aren't either/or.
Pair voice with the rest of the loop
Voice is one input among several — the magic is combining them:
- Voice + image paste. Paste a screenshot of a broken layout, then dictate "fix this so the header doesn't overlap on mobile." The agent gets both the picture and the instruction. See Image paste.
- Voice + approvals. Dictate the prompt, lock the phone, and answer the resulting permission request from your lock screen. No keyboard in the whole cycle. See Live Activity and Notifications.
- Voice + toolbar. Dictate the body of a command and use the toolbar for the flags and symbols.
A hands-free session, start to finish
Here's the loop voice makes possible, with your hands free the whole time:
- Open the project session (it's already running inside tmux on the host).
- Chat mode: dictate the task. Glance, fix a word, send.
- Lock the phone and keep walking.
- A notification: the agent wants to run a command. Read it, tap Allow.
- Another notification: turn complete. Open the diff viewer to read what changed.
- Chat mode: dictate the follow-up — "good, now add a test for the empty case."
At no point did you type. That's the bar voice can clear once you've picked an engine and internalized the two modes.
Troubleshooting
Transcription is inaccurate on technical terms
Try local Whisper or the hosted cloud engine — both handle jargon better than the lightweight on-device default. Or describe intent instead of dictating exact identifiers and let the agent resolve them.
Dictation doesn't work offline
You're on the cloud engine, which needs a connection. Switch to Apple SpeechAnalyzer or local Whisper in Settings for offline use.
Wrong words keep slipping through
Use chat mode, not command mode, for anything that matters — the edit-before-send step exists precisely to catch this. Reserve command mode for short, safe commands.
Where to go next
- Moshi with Claude Code — the loop voice plugs into
- Voice and dictation — engines and settings in depth
- Keyboard and Image paste — the other fast inputs
- CJK input — non-Latin dictation and IME