# Visual Search Models

Jumper uses powerful AI models to help you search through your videos visually. These models understand what's in your footage and let you find moments using natural language.

Jumper's default model V2 Medium is our default choice and works great for most users. It offers a good balance of accuracy and performance.

A MacBook Pro (M1 Pro, 16GB RAM) can handle all models up to V2 Medium and V2 Large (256×256) comfortably. Higher-resolution models (384×384 or 512×512) require more RAM/VRAM. From our tests, 16GB of unified memory maxes out around the same point as a 6GB VRAM Nvidia GPU.

⚠️ Warning: Changing models will require you to re-analyze your footage! However, if you have previously analyzed a file using a model with the same resolution as the new one, the processing will start from 50% done (frame extraction already completed).

Models are downloaded to the .jumper/models directory in your HOME directory.
For example, that would be /Users/<YOUR_USERNAME>/.jumper/models on macOS and C:/Users/<YOUR_USERNAME>/.jumper/models on Windows. (replace <YOUR_USERNAME> with your username e.g. JohnSmith)

# 🎉 New model "V2 Multilingual x-high-res" added 2025-03-30! 🎉

This model achieves a new state-of-the-art performance in search accuracy on multilingual benchmarks, improving accuracy by up to 50% compared to the default model.
Furthermore, the model is also excellent for searching in English. It does however only come in a high-resolution version, requiring longer analysis time and higher demands on computer resources. Will be significantly faster to analyse using computers with Nvidia GPUs.
We highly recommend downloading the latest version and trying it out!

# What's New and Improved in the V2 Models?

# ⚡ TL;DR – Summary

V2 models deliver better accuracy 🔍, multilingual support 🌍, and improved object localization 🎯. They're also more fair & diverse 🏳️‍🌈, reducing biases in search results. V1 models remain available for compatibility, but V2 is recommended for most users.

# 🔍 Better Accuracy

The V2 models perform better on all tasks relevant to Jumper compared to V1 models.

# 🌍 Multilingual Support

All V2 models can handle many languages, making it possible to search accurately in any language. This also improves performance when searching for frames containing non-English text or concepts.

# 📍 Better Localization

The V2 models are better at pinpointing objects and regions within your footage. This leads to more accurate search results, especially when videos contain multiple elements or small details.

# 🏳️‍🌈 Cultural Diversity & Fairness

The V2 models are trained to represent a wider range of cultural contexts and reduce biased associations. For example, V2 models are much less likely to misattribute neutral objects (like cars) as predominantly "men's" or "women's" items. They also perform better at identifying diverse images across different cultures and geographic regions.

The V1 models are still included for backward compatibility, but we recommend V2 models for nearly all use cases. (Note: Certain V1 multilingual models may still slightly outperform V2 in a few specific languages such as Hindi or Thai.)

# Model List

Below is a quick-reference table for both V2 models and V1 models we currently offer. Each model's naming corresponds to how you'd see it inside Jumper.

Model Name (Display)	Resolution	Size on Disk	Information
V2 Medium (Default)	256×256	~1.5GB	Default, bundled in app installer Balanced accuracy vs. performance, improved semantic understanding from V2 model improvements. Works well on most modern computers.
V2 Multilingual x-high-res	512x512	~2GB	NEW, MOST ACCURATE MODEL 💪 Fantastic search result quality in both non-English and English languages. Improves accuracy on multilingual benchmarks by 50% compared to default model. Highly recommended!
V2 Medium high-res	384×384	~1.5GB	Identical parameter count as V2 Medium but processes frames at 384×384 for more image detail. Good if you need slightly finer detail than 256×256, but be aware it requires more RAM/VRAM.
V2 Medium x-high-res	512×512	~1.5GB	Same parameter count as V2 Medium, but even higher resolution (512×512). Ideal for text detection/OCR or detailed reverse image searches. Uses significantly more memory and produces larger analysis files.
V2 Large	256×256	~3.3GB	Larger model with higher accuracy in challenging scenarios.
V2 Large high-res	384×384	~3.3GB	Same as V2 Large but with higher resolution (384×384). Ideal for text detection/OCR or detailed reverse image searches.
V2 Large x-high-res	512×512	~3.3GB	Same parameters as V2 Large, but at 512×512 input resolution. Ideal for text detection/OCR or detailed reverse image searches.
V2 XLarge	256×256	~4.5GB	Even larger V2 model. Offers improved recognition of subtle elements. Ideal if you want top-tier accuracy and have the computer to run it.
V2 XLarge high-res	384×384	~4.5GB	A higher-resolution variant of V2 XLarge.
V2 XLarge x-high-res	512×512	~4.5GB	The heaviest V2 model in terms of resolution and parameter requirements. Even more accurate, even more resource demanding.
Medium	256×256	~812MB	Legacy previous default model (V1), previously bundled with the application. Reasonably accurate but outperformed by V2 Medium in most scenarios. If your hardware can handle V2 Medium, prefer that instead.
Medium x-high-res	512×512	~812MB	Same as V1 Medium but at 512×512. Useful for text detection, reverse image searches, etc. Produces larger analysis files than some bigger V1 models, purely due to the higher frame resolution.
Large multilingual	256×256	~1.48GB	V1 Multilingual version that improves accuracy for non-English text. V2 equivalents are multilingual by default.
Large	256×256	~2.61GB	Larger V1 model. Prefer V2 alternatives.
Large high-res	384×384	~2.61GB	Same as V1 Large but higher resolution. Demands more resources.
XLarge high-res	384×384	~3.51GB	Largest V1 model. High accuracy, but overshadowed by the new V2 XLarge. Recommended only for legacy compatibility if you can't run V2.
XLarge multilingual	256×256	~4.51GB	Largest V1 multilingual model. Very high accuracy for non-English searches - V2 models are multilingual by default, but V1 multilingual models can be slightly better than V2 models on certain languages.

There are also Apple silicon specific versions of some V1 models available on macOS M-series computers. These models are the same but smaller and will run slightly faster than their "normal" counterparts.

One final note – high-resolution models will not necessarily be slower to use during normal queries (e.g., searching in your pre-analyzed footage), but the initial analysis phase demands more hardware resources and creates larger analysis files. For instance, "Medium x-high-res" may generate bigger analysis files than "V2 XLarge", simply because the former uses 512×512 frames.

# Transcription Models

For transcriptions, Jumper uses Whisper models developed by OpenAI. Specifically, the whisper-large-v3-turbo variant is used. The choice of model depends on your platform:

Platform	Model Variant	Size on Disk	Notes
Windows / Intel Mac	whisper-large-v3-turbo	~1.62GB	Bundled with Windows and Intel Mac installers.
Apple M-series Macs	whisper-large-v3-turbo	~467MB	Uses a quantized version converted to Apple's MLX framework for hardware acceleration (faster analysis, smaller model size). Bundled with Apple M-series installer.

Note: The quantized M-series model maintains the same number of parameters but uses fewer decimal points (lower precision) to reduce disk size and leverage Apple's hardware optimizations.

# References

All search models in Jumper are based on the following research:

Sigmoid Loss for Language Image Pre-Training (Zhai, Mustafa, Kolesnikov, Beyer, 2023) https://arxiv.org/abs/2303.15343
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features (Tschannen et al., 2025) https://arxiv.org/abs/2502.14786
Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design (Alabdulmohsin, Zhai, Kolesnikov, Beyer, 2024) https://arxiv.org/abs/2305.13035
Learning Transferable Visual Models From Natural Language Supervision (Radford et al., 2021) https://arxiv.org/abs/2103.00020
Robust Speech Recognition via Large-Scale Weak Supervision (Radford, Kim, Xu, Brockman, McLeavey, Sutskever, 2022) https://arxiv.org/abs/2212.04356
MEXMA: Token-level objectives improve sentence representations (Janeiro, Piwowarski, Gallinari, Barrault, 2024) https://arxiv.org/abs/2409.12737