Skip to main content

Overview

Jumper uses three types of machine learning models to help you search through your footage:

Visual search models

AI systems that can “see” and understand images and video frames, then connect that understanding to text queries or other images.

Speech models

AI systems that can transcribe spoken words into searchable text.

Face detection models

AI systems that can identify and group similar faces together, allowing you to search for specific people.
All models run entirely on your local machine, ensuring your footage never leaves your computer.

Visual Search Models

Visual search models are AI systems that can “see” and understand images and video frames, then connect that understanding to text queries or other images. This is what makes Jumper’s visual search work. How visual search models work: These models learn to find correlations between text and visual data from their training material. When you search for something like “a person walking through a door,” the model has learned to understand:
  • What “a person” looks like visually
  • What “walking” means in terms of motion and pose
  • What “a door” is and how it appears in different contexts
  • How these elements relate to each other in a scene
When you perform a text search, the model identifies the best matching visual elements in your footage. Thanks to efficient algorithms, this makes it possible to find visual content across thousands of hours of video. The model doesn’t just match keywords. It understands the semantic meaning of your search query and finds frames that match that meaning, even if the exact words don’t appear in the footage. Search methods:
  • Text search: Enter a natural language query to find matching visual content
  • Image search: Use an image or frame as your search input instead of text. The model uses similar algorithms to find visually similar content across your footage
What this means for you:
  • You can search using natural, conversational language
  • You can search using images or frames from your footage
  • The model understands context and relationships between objects
  • It works across different languages (multilingual models)
  • Higher resolution models can detect finer details like text on signs or small objects
Model characteristics:
  • Size: Larger models generally offer better accuracy but require more memory and processing power
  • Resolution: Higher resolution models (384×384, 512×512) analyze frames in more detail, useful for detecting text, small objects, or fine visual details
  • Speed: Smaller models analyze faster, while larger models take more time but provide better results
For a complete list of available models and their specifications, see the List of Machine Learning Models.

Speech Models

For speech search, Jumper uses Whisper models developed by OpenAI. These models transcribe spoken words in your footage, making dialogue searchable. How speech models work: The Whisper model listens to the audio track of your media files and converts speech into text. This transcription happens during the analysis phase, so once your media is analyzed, you can search for any word or phrase that was spoken. Key features:
  • Supports 111 languages automatically
  • Handles different accents, background noise, and audio quality
  • Transcribes dialogue with timestamps
  • Works entirely offline with no cloud processing required
Platform differences: The exact Whisper model variant depends on your platform. Apple M-series Macs use an optimized version that takes advantage of Apple’s hardware acceleration, resulting in faster analysis and a smaller model size.

Face Detection Models

Face detection requires a Jumper Pro license
Face detection uses specialized AI models to identify and recognize people in your footage. These models analyze faces frame by frame, extract facial features, and group similar faces together. How face detection models work: The face detection algorithm processes each frame of your media to:
  • Detect faces in the frame
  • Group similar faces together, even when lighting, angles, or expressions change
What this means for you:
  • Automatically identify all appearances of a person across your entire footage library
  • Search for specific people using the @ syntax (e.g., @John sitting on a bench)
  • Organize people into Collections to keep different productions separate
  • Find every scene where someone appears, even if they’re in the background
For more information about using face detection, see Face detection.

Choosing the Right Model

Depending on your specific workflow, usecase and hardware setup, you might want to choose a specific model other than the default. Some suggestions are listed below, or feel free to explore from all the available models.

Fastest

Quick analysis with good accuracy, 256x256 resolution. Great for most workflows.

Fast

Larger model with higher accuracy, 256x256 resolution.

Accurate

Top-tier accuracy with 384×384 resolution. Benefits from more detailed and verbose searches, great multilingual support.

Most Accurate

Highest accuracy with 512x512 resolution. Benefits from more detailed and verbose searches, great multilingual support.
For a complete list of all available models with detailed specifications, see the List of Machine Learning Models.
Last modified on January 28, 2026