Search Documentation
Search across all documentation pages
Media & Multimodal Memory
ExoVault supports multimodal memory — upload video, audio, images, and PDFs alongside text memories. Content is automatically extracted, encrypted, and made searchable.
Supported Formats#
| Type | Formats | Extraction |
|---|---|---|
| Video | MP4, WebM | Full audio transcription + visual descriptions |
| Audio | MP3, WAV, OGG | Full speech transcription |
| Images | PNG, JPG, WebP | Visual description + text recognition (OCR) |
| Full text extraction |
How It Works#
1. Upload#
Upload media files through the dashboard (Media page) or via the API. Files are encrypted with AES-256-GCM before storage in Supabase Storage.
2. Extraction#
After upload, an Inngest background job sends the encrypted file to Gemini 2.5 Flash for content extraction:
- Video: Gemini transcribes all spoken words verbatim with speaker labels, plus describes visual content with timestamps
- Audio: Full speech-to-text transcription
- Images: Describes what's in the image, reads any text/labels
- PDF: Extracts all text content
The extracted text is encrypted and stored alongside the original file.
3. Embedding#
The extracted text is embedded using gemini-embedding-2-preview (3,072 dimensions) — the same multimodal model used for all ExoVault embeddings. This places media content in the same vector space as text memories, enabling cross-modal search.
4. Search#
Agents search media content using the same search_memories tool they use for everything else:
> search_memories("rate limit decision")
✓ Found: "The API rate limit should be 100 requests per second"
from product-review-Q1.mp4 · similarity 0.94No special media-specific search tools needed. If it was said in a video, ExoVault finds it.
Media Attachments#
Media files are attached to memories. When a memory has attachments:
- The
attachmentsfield lists all associated media files - Each attachment includes:
modality,mimeType,fileName,fileSizeBytes,embeddingStatus,extractionStatus,extractedText - Extracted text is included in search results so agents see what was said/shown
Embedding Status#
| Status | Meaning |
|---|---|
pending | Upload complete, waiting for embedding |
processing | Gemini is extracting and embedding |
ready | Extraction and embedding complete — searchable |
failed | Extraction failed (check file format/size) |
File Size Limits#
| Plan | Max File Size |
|---|---|
| Starter | 20 MB |
| Pro | 100 MB |
| Enterprise | Custom |
Cost note: Video embedding is more expensive than text (~$0.00079/frame). ExoVault caps video processing at 20 MB to manage costs. For longer videos, consider uploading the audio track separately.
Privacy & Encryption#
All media follows ExoVault's zero-knowledge model:
- Files are encrypted client-side before upload (dashboard) or server-side via wrapped MEK (agent uploads)
- Gemini processes files for extraction, then the extracted text is encrypted
- The server stores only ciphertext — neither the original media nor the extracted text is accessible in plaintext
- Embeddings are stored alongside encrypted content for search
Dashboard#
The Media page in the dashboard shows all uploaded media across your vaults:
- Preview thumbnails for images
- Playback for audio/video
- Extraction status indicators
- Link to the associated memory
Agent Usage#
Agents interact with media through standard MCP tools — no special media tools needed:
search_memories— finds content extracted from media alongside text memorieswrite_memory— memories can reference media attachmentsread_memories— returns attachment metadata including extracted text
Related Pages#
- Embedding Model — Gemini multimodal embedding details
- Search Strategies — 4-signal hybrid search pipeline
- Encryption Model — How media encryption works
- Limits and Quotas — File size and storage limits