Hands-free with voice

Drive agents by talking — pick the right dictation engine for your privacy and accuracy needs, use chat vs command mode well, and build a workflow where you prompt, approve, and steer without a keyboard.

The fastest prompt is the one you speak. On a phone, dictation isn't a fallback for when typing is inconvenient — it's the primary way to drive an agent when you're walking, cooking, or just thinking out loud. A spoken "add a nullable deleted_at column to users, write the migration, then show me the diff" beats thumb-typing it every time. This guide is about making voice a real, reliable part of your workflow — not a gimmick you try once.

model

Voice pairs naturally with everything else in Moshi: you dictate the prompt, the agent runs inside tmux, and approvals come back to your lock screen. The keyboard barely enters the loop.

What you'll learn

Choose between on-device, local, and cloud dictation engines
Use chat mode and command mode for the jobs each is best at
Dictate punctuation, code-ish terms, and edits cleanly
Combine voice with the toolbar and image paste
Build a genuinely hands-free prompt → approve → review loop

Pick an engine

Moshi offers four dictation backends. They trade off privacy, accuracy, and connectivity differently — choose by what matters to you.

Engine

Runs

Best for

Network

Parakeet

On device

Speed + accuracy for English and European languages

Offline

Apple SpeechAnalyzer

On device

Zero setup on iOS 26+

Offline

Local Whisper

On device

Broadest language coverage

Offline

Cloud

Server

Max accuracy, no model download

Online

Parakeet is the recommended default for English and many European languages — a fast on-device model with no quota that never sends audio off the phone.
Apple SpeechAnalyzer (iOS 26+) is the zero-setup option: nothing to download, fully on-device.
Local Whisper has the broadest language coverage and works on every supported iOS version, with a model you download once.
Cloud typically gives the best accuracy with no model to download, at the cost of sending audio to a server and a metered quota.

Set your engine in Moshi's Settings; the full matrix and language notes are in Voice and dictation.

Talk, don't thumb-typeDictate a full prompt into chat mode, glance, fix a word, send — hands back in your pockets.

warn

If you dictate sensitive content, prefer an on-device engine (SpeechAnalyzer or local Whisper) so audio never leaves your phone. Cloud is a deliberate trade of privacy for accuracy.

Two modes, two jobs

Moshi separates composing from executing, and you'll switch between them constantly.

Chat mode — compose, then send

Chat mode turns speech into editable text you review before sending. Use it for anything with detail: multi-step instructions, anything where a mis-transcribed word would matter, or prompts you want to tweak before the agent acts.

"Refactor the payment service to use the new client, keep the public interface unchanged, and add a test for the retry path." → appears as text → you fix one word → send.

Command mode — straight to the shell

Command mode types your words directly into the terminal. Use it for short, low-risk commands where round-tripping through an edit box is just friction:

"git status" · "run the tests" · "clear"

The rule of thumb: chat mode for prompts to the agent, command mode for commands to the shell. When in doubt, chat mode — the edit step is cheap insurance.

Dictating code-shaped language

You're not dictating prose; you're dictating instructions full of symbols and identifiers. A few habits help:

Speak punctuation by name where it matters — "open paren", "dash", "underscore" — or dictate naturally and fix symbols in chat mode before sending.
Prefer describing intent over spelling exact tokens: "the deleted-at column" lets the agent map it to deleted_at, instead of fighting the transcriber over an underscore.
For non-Latin input, see CJK input — dictation and IME concerns overlap there.

link

When a term simply won't transcribe, fall back to the keyboard toolbar for the exact characters and let voice handle the rest of the sentence. Voice and typing aren't either/or.

Pair voice with the rest of the loop

Voice is one input among several — the magic is combining them:

Voice + image paste. Paste a screenshot of a broken layout, then dictate "fix this so the header doesn't overlap on mobile." The agent gets both the picture and the instruction. See Image paste.
Voice + approvals. Dictate the prompt, lock the phone, and answer the resulting permission request from your lock screen. No keyboard in the whole cycle. See Live Activity and Notifications.
Voice + toolbar. Dictate the body of a command and use the toolbar for the flags and symbols.

A hands-free session, start to finish

Here's the loop voice makes possible, with your hands free the whole time:

Open the project session (it's already running inside tmux on the host).
Chat mode: dictate the task. Glance, fix a word, send.
Lock the phone and keep walking.
A notification: the agent wants to run a command. Read it, tap Allow.
Another notification: turn complete. Open the diff viewer to read what changed.
Chat mode: dictate the follow-up — "good, now add a test for the empty case."

At no point did you type. That's the bar voice can clear once you've picked an engine and internalized the two modes.

Moshi with Claude Code — the loop voice plugs into
Voice and dictation — engines and settings in depth
Keyboard and Image paste — the other fast inputs
CJK input — non-Latin dictation and IME

Hands-free with voice

What you'll learn

Pick an engine

Two modes, two jobs

Chat mode — compose, then send

Command mode — straight to the shell

Dictating code-shaped language

Pair voice with the rest of the loop

A hands-free session, start to finish

Troubleshooting

Transcription is inaccurate on technical terms

Dictation doesn't work offline

Wrong words keep slipping through

Where to go next