Multimodal Content Handling
PRX supports multimodal content -- images, audio, and video -- across its channels and LLM providers. The multimodal subsystem handles content type detection, format transcoding, size enforcement, and capability negotiation between channels and providers.
Overview
When a user sends a media attachment (photo, voice message, document) through a channel, the multimodal pipeline:
- Detects the content type using magic bytes and file extension
- Validates the content against size and format constraints
- Transcodes the content if the target provider does not support the source format
- Dispatches the content to the LLM provider as part of the conversation context
- Handles media in the response if the provider generates images or audio
Channel Input Provider Output
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Content Type │ │ Response │
│ Detection │ │ Media │
└──────┬───────┘ └──────┬───────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Validation │ │ Transcoding │
│ & Limits │ │ (if needed) │
└──────┬───────┘ └──────┬───────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Transcoding │ │ Channel │
│ (if needed) │ │ Delivery │
└──────┬───────┘ └──────────────┘
│
▼
┌──────────────┐
│ Provider │
│ Dispatch │
└──────────────┘Supported Content Types
Images
| Format | Detection | Send to Provider | Receive from Provider |
|---|---|---|---|
| JPEG | Magic bytes FF D8 FF | Yes | Yes |
| PNG | Magic bytes 89 50 4E 47 | Yes | Yes |
| GIF | Magic bytes 47 49 46 | Yes (first frame) | No |
| WebP | RIFF header + WEBP | Yes | Yes |
| BMP | Magic bytes 42 4D | Transcoded to PNG | No |
| TIFF | Magic bytes 49 49 or 4D 4D | Transcoded to PNG | No |
| SVG | XML detection | Rasterized to PNG | No |
Audio
| Format | Detection | Transcription | Provider Input |
|---|---|---|---|
| OGG/Opus | OGG header | Yes (via STT) | Transcribed text |
| MP3 | ID3/sync header | Yes (via STT) | Transcribed text |
| WAV | RIFF + WAVE | Yes (via STT) | Transcribed text |
| M4A/AAC | ftyp box | Yes (via STT) | Transcribed text |
| WebM | EBML header | Yes (via STT) | Transcribed text |
Video
| Format | Detection | Processing |
|---|---|---|
| MP4 | ftyp box | Extract keyframes + audio track |
| WebM | EBML header | Extract keyframes + audio track |
| MOV | ftyp box | Extract keyframes + audio track |
Video files are decomposed into keyframe images and an audio track. The keyframes are sent as images and the audio is transcribed.
Content Type Detection
Detection uses a two-pass approach:
- Magic bytes -- the first 16 bytes of the file are checked against known signatures
- File extension -- if magic bytes are inconclusive, the file extension is used as a fallback
- MIME type header -- for content received via HTTP, the
Content-Typeheader is consulted
The detection result determines which processing pipeline handles the content.
Configuration
[multimodal]
enabled = true
[multimodal.images]
max_size_bytes = 20_971_520 # 20 MB
max_resolution = "4096x4096" # maximum width x height
auto_resize = true # resize images exceeding max_resolution
resize_quality = 85 # JPEG quality for resized images (1-100)
strip_exif = true # remove EXIF metadata for privacy
[multimodal.audio]
max_size_bytes = 26_214_400 # 25 MB
max_duration_secs = 300 # 5 minutes
stt_provider = "whisper" # "whisper", "deepgram", or "provider" (use LLM provider's STT)
stt_model = "whisper-1"
stt_language = "auto" # "auto" for language detection, or ISO 639-1 code
[multimodal.video]
max_size_bytes = 104_857_600 # 100 MB
max_duration_secs = 120 # 2 minutes
keyframe_interval_secs = 5 # extract one keyframe every 5 seconds
max_keyframes = 20 # maximum keyframes to extract
extract_audio = true # transcribe audio trackConfiguration Reference
Images
| Field | Type | Default | Description |
|---|---|---|---|
max_size_bytes | u64 | 20971520 | Maximum image file size (20 MB) |
max_resolution | String | "4096x4096" | Maximum image dimensions (WxH) |
auto_resize | bool | true | Automatically resize oversized images |
resize_quality | u8 | 85 | JPEG quality for resized images (1--100) |
strip_exif | bool | true | Remove EXIF metadata from images |
Audio
| Field | Type | Default | Description |
|---|---|---|---|
max_size_bytes | u64 | 26214400 | Maximum audio file size (25 MB) |
max_duration_secs | u64 | 300 | Maximum audio duration (5 minutes) |
stt_provider | String | "whisper" | Speech-to-text provider |
stt_model | String | "whisper-1" | STT model name |
stt_language | String | "auto" | Language hint for transcription |
Video
| Field | Type | Default | Description |
|---|---|---|---|
max_size_bytes | u64 | 104857600 | Maximum video file size (100 MB) |
max_duration_secs | u64 | 120 | Maximum video duration (2 minutes) |
keyframe_interval_secs | u64 | 5 | Seconds between extracted keyframes |
max_keyframes | usize | 20 | Maximum number of keyframes to extract |
extract_audio | bool | true | Transcribe the video's audio track |
Provider Capabilities
Not all LLM providers support the same media types. PRX negotiates capabilities automatically:
| Provider | Image Input | Image Output | Audio Input | Native Multimodal |
|---|---|---|---|---|
| Anthropic (Claude) | Yes | No | No (transcribe first) | Yes (vision) |
| OpenAI (GPT-4o) | Yes | Yes (DALL-E) | Yes (Whisper) | Yes |
| Google (Gemini) | Yes | Yes (Imagen) | Yes | Yes |
| Ollama (LLaVA) | Yes | No | No | Yes (vision) |
| AWS Bedrock | Varies by model | Varies | No | Varies |
When a provider does not support a media type natively, PRX applies fallback processing:
- Image not supported -- image is described using a vision-capable model, and the description is sent as text
- Audio not supported -- audio is transcribed using the configured STT provider, and the transcript is sent as text
- Video not supported -- keyframes and audio transcript are sent as a composite message
Channel Media Limits
Each channel imposes its own file size and format restrictions:
| Channel | Max Upload | Max Download | Supported Formats |
|---|---|---|---|
| Telegram | 50 MB | 20 MB | Images, audio, video, documents |
| Discord | 25 MB (free) | 25 MB | Images, audio, video, documents |
| 16 MB (media) | 16 MB | JPEG, PNG, MP3, MP4, PDF | |
| 20 MB | 20 MB | Images, audio, documents | |
| DingTalk | 20 MB | 20 MB | Images, audio, documents |
| Lark | 25 MB | 25 MB | Images, audio, video, documents |
| Matrix | Homeserver dependent | Homeserver dependent | All common formats |
| 25 MB (typical) | 25 MB | All via MIME attachments | |
| CLI | Filesystem limit | Filesystem limit | All formats |
PRX enforces the channel's limits before attempting to send a response. If a generated image or file exceeds the channel limit, it is compressed or a download link is provided instead.
Transcoding Pipeline
When format conversion is needed, PRX uses the following transcoding pipeline:
- Image transcoding -- handled by the
imagecrate (pure Rust, no external dependencies) - Audio transcoding -- handled by FFmpeg if installed, otherwise falls back to native decoders for common formats
- Video keyframe extraction -- requires FFmpeg
FFmpeg Detection
PRX automatically detects FFmpeg at startup:
prx doctor multimodalOutput:
Multimodal Support:
Images: OK (native)
Audio transcoding: OK (ffmpeg 6.1 detected)
Video processing: OK (ffmpeg 6.1 detected)
STT provider: OK (whisper-1 via OpenAI)If FFmpeg is not installed, audio transcoding and video processing are limited to natively supported formats.
Limitations
- Video processing requires FFmpeg to be installed on the system
- Large media files may significantly increase LLM token usage (especially multiple keyframes)
- Some providers charge additional fees for vision/multimodal API calls
- Real-time audio streaming (live voice conversation) is not yet supported
- Generated images from providers (DALL-E, Imagen) are subject to the provider's content policy
- SVG rasterization uses a basic renderer; complex SVGs may not render accurately
Related Pages
- Agent Runtime -- how media content flows through the agent loop
- Channels Overview -- channel-specific media handling
- Providers Overview -- provider multimodal capabilities
- Embeddings Backend -- embedding models for memory