Voice and dictation

Choose a speech engine, download Whisper models, set languages, use chat mode, and review transcription history.

Moshi can turn speech into terminal input. Dictation is most useful for natural-language agent prompts, short shell commands, and editing text without fighting the iOS keyboard.

Speech engines

Moshi has three speech engines. Pick one in Settings -> Speech.

Apple uses the on-device SpeechAnalyzer framework on iOS 26+. Nothing leaves the phone, there is no model to download, and it is the fastest option on supported devices. Older iOS versions hide this option.
Whisper runs whisper.cpp locally with a model you download. It works on every supported iOS version and is the right choice when Apple's engine is unavailable, when you need the same engine across devices, or when you want full offline use.
Cloud sends audio to Moshi's hosted transcription. It typically gives the best accuracy with no model to download, but it requires a registered push token and is metered: free accounts get a small daily quota and Pro accounts get a larger one. The dictation settings screen shows remaining and total quota.

For Whisper, model choices include small English-only models and larger multilingual models. Larger models use more storage and may be slower, but can improve quality. Downloaded models stay on the device and can be removed later to reclaim storage.

Language

Language can be automatic or pinned to a specific language. Use automatic when you switch languages often; use a fixed language when the engine keeps guessing wrong. Apple and Cloud always expose a language picker; Whisper exposes one only when a multilingual model is selected (English-only models always transcribe English).

Chat mode

Chat mode changes what dictation actually does.

Off: dictation streams transcribed text straight into the terminal as keystrokes — useful for shell commands, REPL input, and short terse fragments. With Auto-send on, the result also presses Enter.
On: dictation opens a composer panel above the terminal. Voice, typed text, and image attachments are drafted together and sent as one message when you tap send. This is the right mode for natural-language agent prompts (Claude Code, Codex, OpenCode, Gemini, Cursor, Kimi, Qwen) where you want to review or edit before the agent sees it.

Chat mode also lets you keep dictating while you fix a typo, paste an image, or add a follow-up sentence — none of it reaches the remote shell until you send.

Turn chat mode off if your work is mostly shell commands and tmux navigation; turn it on if you mostly talk to an agent.

Images in prompts

When chat mode is on, you can attach an image to the prompt you are dictating. Pick a photo, paste an image already on the iOS clipboard, or annotate it in the built-in image editor first. Moshi sends a short URL inline with your text so the agent can fetch it — without writing the image to the host. See Image paste for the full flow.

Auto-send

Auto-send submits the transcription after dictation finishes. Leave it off if you prefer to review text before it reaches the terminal.

Transcription history

Moshi keeps a transcription history so you can reuse recent dictation. This is useful when a long prompt needs a small correction or when you want to send a similar instruction to another session.

Practical tips

Use push-to-talk for short prompts.
Say punctuation explicitly when writing code-like text.
Review destructive commands before sending.
Download the model you intend to use before relying on dictation away from a fast network.