Nodes and media
Media understanding
OmeniaClaw can summarize inbound media (image/audio/video) before the reply pipeline runs. It auto-detects when local tools or provider keys are available, and can be disabled or customized. If understanding is off, models still receive the original files/URLs as usual.
Vendor-specific media behavior is registered by vendor plugins, while OmeniaClaw core owns the shared tools.media config, fallback order, and reply-pipeline integration.
Goals
- Optional: pre-digest inbound media into short text for faster routing + better command parsing.
- Preserve original media delivery to the model (always).
- Support provider APIs and CLI fallbacks.
- Allow multiple models with ordered fallback (error/size/timeout).
High-level behavior
Collect attachments
Collect inbound attachments (MediaPaths, MediaUrls, MediaTypes).
Select per-capability
For each enabled capability (image/audio/video), select attachments per policy (default: first).
Choose model
Choose the first eligible model entry (size + capability + auth).
Fallback on failure
If a model fails or the media is too large, fall back to the next entry.
Apply success block
On success:
Bodybecomes[Image],[Audio], or[Video]block.- Audio sets
{{Transcript}}; command parsing uses caption text when present, otherwise the transcript. - Captions are preserved as
User text:inside the block.
If understanding fails or is disabled, the reply flow continues with the original body + attachments.
Config overview
tools.media supports shared models plus per-capability overrides:
Top-level keys
tools.media.models: shared model list (usecapabilitiesto gate).tools.media.image/tools.media.audio/tools.media.video:- defaults (
prompt,maxChars,maxBytes,timeoutSeconds,language) - provider overrides (
baseUrl,headers,providerOptions) - Deepgram audio options via
tools.media.audio.providerOptions.deepgram - audio transcript echo controls (
echoTranscript, defaultfalse;echoFormat) - optional per-capability
modelslist (preferred before shared models) attachmentspolicy (mode,maxAttachments,prefer)scope(optional gating by channel/chatType/session key)
- defaults (
tools.media.concurrency: max concurrent capability runs (default 2).
{ tools: { media: { models: [ /* shared list */ ], image: { /* optional overrides */ }, audio: { /* optional overrides */ echoTranscript: true, echoFormat: '📝 "{transcript}"', }, video: { /* optional overrides */ }, }, },}Model entries
Each models[] entry can be provider or CLI:
Provider entry
{ type: "provider", // default if omitted provider: "openai", model: "gpt-5.5", prompt: "Describe the image in <= 500 chars.", maxChars: 500, maxBytes: 10485760, timeoutSeconds: 60, capabilities: ["image"], // optional, used for multi-modal entries profile: "vision-profile", preferredProfile: "vision-fallback",}CLI entry
{ type: "cli", command: "gemini", args: [ "-m", "gemini-3-flash", "--allowed-tools", "read_file", "Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters.", ], maxChars: 500, maxBytes: 52428800, timeoutSeconds: 120, capabilities: ["video", "image"],}CLI templates can also use:
{{MediaDir}}(directory containing the media file){{OutputDir}}(scratch dir created for this run){{OutputBase}}(scratch file base path, no extension)
Provider credentials (apiKey)
Provider media understanding uses the same provider auth resolution as normal
model calls: auth profiles, environment variables, then
models.providers.<providerId>.apiKey.
tools.media.*.models[] entries do not accept an inline apiKey field. The
provider value in a media model entry, such as openai or moonshot, must
have credentials available through one of the standard provider auth sources.
Minimal example:
{ models: { providers: { openai: { apiKey: "<OPENAI_API_KEY>" }, moonshot: { apiKey: "<MOONSHOT_API_KEY>" }, }, },}For the full provider auth reference, including profiles, environment variables, and custom base URLs, see Tools and custom providers.
Defaults and limits
Recommended defaults:
maxChars: 500 for image/video (short, command-friendly)maxChars: unset for audio (full transcript unless you set a limit)maxBytes:- image: 10MB
- audio: 20MB
- video: 50MB
Rules
- If media exceeds
maxBytes, that model is skipped and the next model is tried. - Audio files smaller than 1024 bytes are treated as empty/corrupt and skipped before provider/CLI transcription; inbound reply context receives a deterministic placeholder transcript so the agent knows the note was too small.
- If the model returns more than
maxChars, output is trimmed. promptdefaults to simple "Describe the {media}." plus themaxCharsguidance (image/video only).- If the active primary image model already supports vision natively, OmeniaClaw skips the
[Image]summary block and passes the original image into the model instead. - If a Gateway/WebChat primary model is text-only, image attachments are preserved as offloaded
media://inbound/*refs so the image/PDF tools or configured image model can still inspect them instead of losing the attachment. - Explicit
OmeniaClaw infer image describe --model <provider/model>requests are different: they run that image-capable provider/model directly, including Ollama refs such asollama/qwen2.5vl:7b. - If
<capability>.enabled: truebut no models are configured, OmeniaClaw tries the active reply model when its provider supports the capability.
Auto-detect media understanding (default)
If tools.media.<capability>.enabled is not set to false and you haven't configured models, OmeniaClaw auto-detects in this order and stops at the first working option:
Active reply model
Active reply model when its provider supports the capability.
agents.defaults.imageModel
agents.defaults.imageModel primary/fallback refs (image only).
Prefer provider/model refs. Bare refs are qualified from configured image-capable provider model entries only when the match is unique.
Local CLIs (audio only)
Local CLIs (if installed):
sherpa-onnx-offline(requiresSHERPA_ONNX_MODEL_DIRwith encoder/decoder/joiner/tokens)whisper-cli(whisper-cpp; usesWHISPER_CPP_MODELor the bundled tiny model)whisper(Python CLI; downloads models automatically)
Gemini CLI
gemini using read_many_files.
Provider auth
- Configured
models.providers.*entries that support the capability are tried before the bundled fallback order. - Image-only config providers with an image-capable model auto-register for media understanding even when they are not a bundled vendor plugin.
- Ollama image understanding is available when selected explicitly, for example through
agents.defaults.imageModelorOmeniaClaw infer image describe --model ollama/<vision-model>.
Bundled fallback order:
- Audio: OpenAI → Groq → xAI → Deepgram → OpenRouter → Google → SenseAudio → ElevenLabs → Mistral
- Image: OpenAI → Anthropic → Google → MiniMax → MiniMax Portal → Z.AI
- Video: Google → Qwen → Moonshot
To disable auto-detection, set:
{ tools: { media: { audio: { enabled: false, }, }, },}Proxy environment support (provider models)
When provider-based audio and video media understanding is enabled, OmeniaClaw honors standard outbound proxy environment variables for provider HTTP calls:
HTTPS_PROXYHTTP_PROXYALL_PROXYhttps_proxyhttp_proxyall_proxy
If no proxy env vars are set, media understanding uses direct egress. If the proxy value is malformed, OmeniaClaw logs a warning and falls back to direct fetch.
Capabilities (optional)
If you set capabilities, the entry only runs for those media types. For shared lists, OmeniaClaw can infer defaults:
openai,anthropic,minimax: imageminimax-portal: imagemoonshot: image + videoopenrouter: image + audiogoogle(Gemini API): image + audio + videoqwen: image + videomistral: audiozai: imagegroq: audioxai: audiodeepgram: audio- Any
models.providers.<id>.models[]catalog with an image-capable model: image
For CLI entries, set capabilities explicitly to avoid surprising matches. If you omit capabilities, the entry is eligible for the list it appears in.
Provider support matrix (OmeniaClaw integrations)
| Capability | Provider integration | Notes |
|---|---|---|
| Image | OpenAI, OpenAI Codex OAuth, Codex app-server, OpenRouter, Anthropic, Google, MiniMax, Moonshot, Qwen, Z.AI, config providers | Vendor plugins register image support; openai/* can use API-key or Codex OAuth routing; codex/* uses a bounded Codex app-server turn; MiniMax and MiniMax OAuth both use MiniMax-VL-01; image-capable config providers auto-register. |
| Audio | OpenAI, Groq, xAI, Deepgram, OpenRouter, Google, SenseAudio, ElevenLabs, Mistral | Provider transcription (Whisper/Groq/xAI/Deepgram/OpenRouter STT/Gemini/SenseAudio/Scribe/Voxtral). |
| Video | Google, Qwen, Moonshot | Provider video understanding via vendor plugins; Qwen video understanding uses the Standard DashScope endpoints. |
Model selection guidance
- Prefer the strongest latest-generation model available for each media capability when quality and safety matter.
- For tool-enabled agents handling untrusted inputs, avoid older/weaker media models.
- Keep at least one fallback per capability for availability (quality model + faster/cheaper model).
- CLI fallbacks (
whisper-cli,whisper,gemini) are useful when provider APIs are unavailable. parakeet-mlxnote: with--output-dir, OmeniaClaw reads<output-dir>/<media-basename>.txtwhen output format istxt(or unspecified); non-txtformats fall back to stdout.
Attachment policy
Per-capability attachments controls which attachments are processed:
mode"first" | "all"default: firstWhether to process the first selected attachment or all of them.
maxAttachmentsnumberdefault: 1Cap the number processed.
prefer"first" | "last" | "path" | "url"Selection preference among candidate attachments.
When mode: "all", outputs are labeled [Image 1/2], [Audio 2/2], etc.
File-attachment extraction behavior
- Extracted file text is wrapped as untrusted external content before it is appended to the media prompt.
- The injected block uses explicit boundary markers like
<<<EXTERNAL_UNTRUSTED_CONTENT id="...">>>/<<<END_EXTERNAL_UNTRUSTED_CONTENT id="...">>>and includes aSource: Externalmetadata line. - This attachment-extraction path intentionally omits the long
SECURITY NOTICE:banner to avoid bloating the media prompt; the boundary markers and metadata still remain. - If a file has no extractable text, OmeniaClaw injects
[No extractable text]. - If a PDF falls back to rendered page images in this path, the media prompt keeps the placeholder
[PDF content rendered to images; images not forwarded to model]because this attachment-extraction step forwards text blocks, not the rendered PDF images.
Config examples
Shared models + overrides
{ tools: { media: { models: [ { provider: "openai", model: "gpt-5.5", capabilities: ["image"] }, { provider: "google", model: "gemini-3-flash-preview", capabilities: ["image", "audio", "video"], }, { type: "cli", command: "gemini", args: [ "-m", "gemini-3-flash", "--allowed-tools", "read_file", "Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters.", ], capabilities: ["image", "video"], }, ], audio: { attachments: { mode: "all", maxAttachments: 2 }, }, video: { maxChars: 500, }, }, },}Audio + video only
{ tools: { media: { audio: { enabled: true, models: [ { provider: "openai", model: "gpt-4o-mini-transcribe" }, { type: "cli", command: "whisper", args: ["--model", "base", "{{MediaPath}}"], }, ], }, video: { enabled: true, maxChars: 500, models: [ { provider: "google", model: "gemini-3-flash-preview" }, { type: "cli", command: "gemini", args: [ "-m", "gemini-3-flash", "--allowed-tools", "read_file", "Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters.", ], }, ], }, }, },}Image-only
{ tools: { media: { image: { enabled: true, maxBytes: 10485760, maxChars: 500, models: [ { provider: "openai", model: "gpt-5.5" }, { provider: "anthropic", model: "claude-opus-4-6" }, { type: "cli", command: "gemini", args: [ "-m", "gemini-3-flash", "--allowed-tools", "read_file", "Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters.", ], }, ], }, }, },}Multi-modal single entry
{ tools: { media: { image: { models: [ { provider: "google", model: "gemini-3.1-pro-preview", capabilities: ["image", "video", "audio"], }, ], }, audio: { models: [ { provider: "google", model: "gemini-3.1-pro-preview", capabilities: ["image", "video", "audio"], }, ], }, video: { models: [ { provider: "google", model: "gemini-3.1-pro-preview", capabilities: ["image", "video", "audio"], }, ], }, }, },}Status output
When media understanding runs, /status includes a short summary line:
📎 Media: image ok (openai/gpt-5.4) · audio skipped (maxBytes)This shows per-capability outcomes and the chosen provider/model when applicable.
Notes
- Understanding is best-effort. Errors do not block replies.
- Attachments are still passed to models even when understanding is disabled.
- Use
scopeto limit where understanding runs (e.g. only DMs).