Media & Multimodal Memory

ExoVault supports multimodal memory — upload video, audio, images, and PDFs alongside text memories. Content is automatically extracted, encrypted, and made searchable.

Supported Formats#

Type	Formats	Extraction
Video	MP4, WebM	Full audio transcription + visual descriptions
Audio	MP3, WAV, OGG	Full speech transcription
Images	PNG, JPG, WebP	Visual description + text recognition (OCR)
PDF	PDF	Full text extraction

How It Works#

1. Upload#

Upload media files through the dashboard (Media page) or via the API. Files are encrypted with AES-256-GCM before storage in Supabase Storage.

2. Extraction#

After upload, an Inngest background job sends the encrypted file to Gemini 2.5 Flash for content extraction:

Video: Gemini transcribes all spoken words verbatim with speaker labels, plus describes visual content with timestamps
Audio: Full speech-to-text transcription
Images: Describes what's in the image, reads any text/labels
PDF: Extracts all text content

The extracted text is encrypted and stored alongside the original file.

3. Embedding#

The extracted text is embedded using gemini-embedding-2-preview (3,072 dimensions) — the same multimodal model used for all ExoVault embeddings. This places media content in the same vector space as text memories, enabling cross-modal search.

4. Search#

Agents search media content using the same search_memories tool they use for everything else:

> search_memories("rate limit decision")

✓ Found: "The API rate limit should be 100 requests per second"
  from product-review-Q1.mp4 · similarity 0.94

No special media-specific search tools needed. If it was said in a video, ExoVault finds it.

Media Attachments#

Media files are attached to memories. When a memory has attachments:

The attachments field lists all associated media files
Each attachment includes: modality, mimeType, fileName, fileSizeBytes, embeddingStatus, extractionStatus, extractedText
Extracted text is included in search results so agents see what was said/shown

Embedding Status#

Status	Meaning
`pending`	Upload complete, waiting for embedding
`processing`	Gemini is extracting and embedding
`ready`	Extraction and embedding complete — searchable
`failed`	Extraction failed (check file format/size)

File Size Limits#

Plan	Max File Size
Starter	20 MB
Pro	100 MB
Enterprise	Custom

Cost note: Video embedding is more expensive than text (~$0.00079/frame). ExoVault caps video processing at 20 MB to manage costs. For longer videos, consider uploading the audio track separately.

Privacy & Encryption#

All media follows ExoVault's zero-knowledge model:

Files are encrypted client-side before upload (dashboard) or server-side via wrapped MEK (agent uploads)
Gemini processes files for extraction, then the extracted text is encrypted
The server stores only ciphertext — neither the original media nor the extracted text is accessible in plaintext
Embeddings are stored alongside encrypted content for search

Dashboard#

The Media page in the dashboard shows all uploaded media across your vaults:

Preview thumbnails for images
Playback for audio/video
Extraction status indicators
Link to the associated memory

Agent Usage#

Agents interact with media through standard MCP tools — no special media tools needed:

search_memories — finds content extracted from media alongside text memories
write_memory — memories can reference media attachments
read_memories — returns attachment metadata including extracted text

Embedding Model — Gemini multimodal embedding details
Search Strategies — 4-signal hybrid search pipeline
Encryption Model — How media encryption works
Limits and Quotas — File size and storage limits

Search Documentation