#
Visual Search Models
Jumper uses powerful AI models to help you search through your videos visually. These models understand what's in your footage and let you find moments using natural language.
Jumper's default model
V2 Medium is our default choice and works great for most users. It offers a good balance of accuracy and performance.
A MacBook Pro (M1 Pro, 16GB RAM) can handle all models up to V2 Medium and V2 Large (256×256) comfortably.
Higher-resolution models (384×384 or 512×512) require more RAM/VRAM. From our tests, 16GB of unified memory maxes out around the same point as a 6GB VRAM Nvidia GPU.
⚠️ Warning: Changing models will require you to re-analyze your footage!
However, if you have previously analyzed a file using a model with the same resolution as the new one, the processing will start from 50% done (frame extraction already completed).
Models are downloaded to the .jumper/models
directory in your HOME directory.
For example, that would be /Users/<YOUR_USERNAME>/.jumper/models
on macOS and C:/Users/<YOUR_USERNAME>/.jumper/models
on Windows. (replace <YOUR_USERNAME>
with your username e.g. JohnSmith
)
#
🎉 New model "V2 Multilingual x-high-res" added 2025-03-30! 🎉
This model achieves a new state-of-the-art performance in search accuracy on multilingual benchmarks, improving accuracy by up to 50% compared to the default model.
Furthermore, the model is also excellent for searching in English. It does however only come in a high-resolution version, requiring longer analysis time and higher demands on computer resources.
Will be significantly faster to analyse using computers with Nvidia GPUs.
We highly recommend downloading the latest version and trying it out!
#
What's New and Improved in the V2 Models?
#
⚡ TL;DR – Summary
V2 models deliver better accuracy 🔍, multilingual support 🌍, and improved object localization 🎯. They're also more fair & diverse 🏳️🌈, reducing biases in search results.
V1 models remain available for compatibility, but V2 is recommended for most users.
#
🔍 Better Accuracy
The V2 models perform better on all tasks relevant to Jumper compared to V1 models.
#
🌍 Multilingual Support
All V2 models can handle many languages, making it possible to search accurately in any language.
This also improves performance when searching for frames containing non-English text or concepts.
#
📍 Better Localization
The V2 models are better at pinpointing objects and regions within your footage.
This leads to more accurate search results, especially when videos contain multiple elements or small details.
#
🏳️🌈 Cultural Diversity & Fairness
The V2 models are trained to represent a wider range of cultural contexts and reduce biased associations.
For example, V2 models are much less likely to misattribute neutral objects (like cars) as predominantly "men's" or "women's" items.
They also perform better at identifying diverse images across different cultures and geographic regions.
The V1 models are still included for backward compatibility, but we recommend V2 models for nearly all use cases. (Note: Certain V1 multilingual models may still slightly outperform V2 in a few specific languages such as Hindi or Thai.)
#
Model List
Below is a quick-reference table for both V2 models and V1 models we currently offer. Each model's naming corresponds to how you'd see it inside Jumper.
One final note – high-resolution models will not necessarily be slower to use during normal queries (e.g., searching in your pre-analyzed footage), but the initial analysis phase demands more hardware resources and creates larger analysis files. For instance, "Medium x-high-res"
may generate bigger analysis files than "V2 XLarge"
, simply because the former uses 512×512 frames.
#
Transcription Models
For transcriptions, Jumper uses Whisper models developed by OpenAI. Specifically, the whisper-large-v3-turbo
variant is used. The choice of model depends on your platform:
Note: The quantized M-series model maintains the same number of parameters but uses fewer decimal points (lower precision) to reduce disk size and leverage Apple's hardware optimizations.
#
References
All search models in Jumper are based on the following research:
Sigmoid Loss for Language Image Pre-Training (Zhai, Mustafa, Kolesnikov, Beyer, 2023) https://arxiv.org/abs/2303.15343
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features (Tschannen et al., 2025) https://arxiv.org/abs/2502.14786
Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design (Alabdulmohsin, Zhai, Kolesnikov, Beyer, 2024) https://arxiv.org/abs/2305.13035
Learning Transferable Visual Models From Natural Language Supervision (Radford et al., 2021) https://arxiv.org/abs/2103.00020
Robust Speech Recognition via Large-Scale Weak Supervision (Radford, Kim, Xu, Brockman, McLeavey, Sutskever, 2022) https://arxiv.org/abs/2212.04356
MEXMA: Token-level objectives improve sentence representations (Janeiro, Piwowarski, Gallinari, Barrault, 2024) https://arxiv.org/abs/2409.12737