Skip to content

Search Documentation

Search across all documentation pages

Media & Multimodal Memory

ExoVault supports multimodal memory — upload video, audio, images, and PDFs alongside text memories. Content is automatically extracted, encrypted, and made searchable.

Supported Formats#

TypeFormatsExtraction
VideoMP4, WebMFull audio transcription + visual descriptions
AudioMP3, WAV, OGGFull speech transcription
ImagesPNG, JPG, WebPVisual description + text recognition (OCR)
PDFPDFFull text extraction

How It Works#

1. Upload#

Upload media files through the dashboard (Media page) or via the API. Files are encrypted with AES-256-GCM before storage in Supabase Storage.

2. Extraction#

After upload, an Inngest background job sends the encrypted file to Gemini 2.5 Flash for content extraction:

  • Video: Gemini transcribes all spoken words verbatim with speaker labels, plus describes visual content with timestamps
  • Audio: Full speech-to-text transcription
  • Images: Describes what's in the image, reads any text/labels
  • PDF: Extracts all text content

The extracted text is encrypted and stored alongside the original file.

3. Embedding#

The extracted text is embedded using gemini-embedding-2-preview (3,072 dimensions) — the same multimodal model used for all ExoVault embeddings. This places media content in the same vector space as text memories, enabling cross-modal search.

Agents search media content using the same search_memories tool they use for everything else:

> search_memories("rate limit decision")

✓ Found: "The API rate limit should be 100 requests per second"
  from product-review-Q1.mp4 · similarity 0.94

No special media-specific search tools needed. If it was said in a video, ExoVault finds it.

Media Attachments#

Media files are attached to memories. When a memory has attachments:

  • The attachments field lists all associated media files
  • Each attachment includes: modality, mimeType, fileName, fileSizeBytes, embeddingStatus, extractionStatus, extractedText
  • Extracted text is included in search results so agents see what was said/shown

Embedding Status#

StatusMeaning
pendingUpload complete, waiting for embedding
processingGemini is extracting and embedding
readyExtraction and embedding complete — searchable
failedExtraction failed (check file format/size)

File Size Limits#

PlanMax File Size
Starter20 MB
Pro100 MB
EnterpriseCustom

Cost note: Video embedding is more expensive than text (~$0.00079/frame). ExoVault caps video processing at 20 MB to manage costs. For longer videos, consider uploading the audio track separately.

Privacy & Encryption#

All media follows ExoVault's zero-knowledge model:

  1. Files are encrypted client-side before upload (dashboard) or server-side via wrapped MEK (agent uploads)
  2. Gemini processes files for extraction, then the extracted text is encrypted
  3. The server stores only ciphertext — neither the original media nor the extracted text is accessible in plaintext
  4. Embeddings are stored alongside encrypted content for search

Dashboard#

The Media page in the dashboard shows all uploaded media across your vaults:

  • Preview thumbnails for images
  • Playback for audio/video
  • Extraction status indicators
  • Link to the associated memory

Agent Usage#

Agents interact with media through standard MCP tools — no special media tools needed:

  • search_memories — finds content extracted from media alongside text memories
  • write_memory — memories can reference media attachments
  • read_memories — returns attachment metadata including extracted text