Overview
Jumper uses three types of machine learning models to help you search through your footage:Visual search models
AI systems that can “see” and understand images and video frames, then connect that understanding to text queries or other images.
Speech models
AI systems that can transcribe spoken words into searchable text.
Face detection models
AI systems that can identify and group similar faces together, allowing you to search for specific people.
Visual Search Models
Visual search models are AI systems that can “see” and understand images and video frames, then connect that understanding to text queries or other images. This is what makes Jumper’s visual search work. How visual search models work: These models learn to find correlations between text and visual data from their training material. When you search for something like “a person walking through a door,” the model has learned to understand:- What “a person” looks like visually
- What “walking” means in terms of motion and pose
- What “a door” is and how it appears in different contexts
- How these elements relate to each other in a scene
- Text search: Enter a natural language query to find matching visual content
- Image search: Use an image or frame as your search input instead of text. The model uses similar algorithms to find visually similar content across your footage
- You can search using natural, conversational language
- You can search using images or frames from your footage
- The model understands context and relationships between objects
- It works across different languages (multilingual models)
- Higher resolution models can detect finer details like text on signs or small objects
- Size: Larger models generally offer better accuracy but require more memory and processing power
- Resolution: Higher resolution models (384×384, 512×512) analyze frames in more detail, useful for detecting text, small objects, or fine visual details
- Speed: Smaller models analyze faster, while larger models take more time but provide better results
Speech Models
For speech search, Jumper uses Whisper models developed by OpenAI. These models transcribe spoken words in your footage, making dialogue searchable. How speech models work: The Whisper model listens to the audio track of your media files and converts speech into text. This transcription happens during the analysis phase, so once your media is analyzed, you can search for any word or phrase that was spoken. Key features:- Supports 111 languages automatically
- Handles different accents, background noise, and audio quality
- Transcribes dialogue with timestamps
- Works entirely offline with no cloud processing required
Face Detection Models
Face detection requires a Jumper Pro license
- Detect faces in the frame
- Group similar faces together, even when lighting, angles, or expressions change
- Automatically identify all appearances of a person across your entire footage library
- Search for specific people using the
@syntax (e.g.,@John sitting on a bench) - Organize people into Collections to keep different productions separate
- Find every scene where someone appears, even if they’re in the background
Choosing the Right Model
Depending on your specific workflow, usecase and hardware setup, you might want to choose a specific model other than the default. Some suggestions are listed below, or feel free to explore from all the available models.Fastest
Quick analysis with good accuracy, 256x256 resolution. Great for most workflows.
Fast
Larger model with higher accuracy, 256x256 resolution.
Accurate
Top-tier accuracy with 384×384 resolution.
Benefits from more detailed and verbose searches, great multilingual support.
Most Accurate
Highest accuracy with 512x512 resolution.
Benefits from more detailed and verbose searches, great multilingual support.

