# Visual Search Models

Jumper uses powerful AI models to help you search through your videos visually. These models understand what's in your footage and let you find moments using natural language.

Jumper's default model
V2 Medium is our default choice and works great for most users. It offers a good balance of accuracy and performance.

A MacBook Pro (M1 Pro, 16GB RAM) can handle all models up to V2 Medium and V2 Large (256×256) comfortably.
Higher-resolution models (384×384 or 512×512) require more RAM/VRAM. From our tests, 16GB of unified memory maxes out around the same point as a 6GB VRAM Nvidia GPU.

⚠️ Warning: Changing models will require you to re-analyze your footage!
However, if you have previously analyzed a file using a model with the same resolution as the new one, the processing will start from 50% done (frame extraction already completed).

Models are downloaded to the .jumper/models directory in your HOME directory.
For example, that would be /Users/<YOUR_USERNAME>/.jumper/models on macOS and C:/Users/<YOUR_USERNAME>/.jumper/models on Windows. (replace <YOUR_USERNAME> with your username e.g. JohnSmith)

# 🎉 New model "V2 Multilingual x-high-res" added 2025-03-30! 🎉

This model achieves a new state-of-the-art performance in search accuracy on multilingual benchmarks, improving accuracy by up to 50% compared to the default model.
Furthermore, the model is also excellent for searching in English. It does however only come in a high-resolution version, requiring longer analysis time and higher demands on computer resources. Will be significantly faster to analyse using computers with Nvidia GPUs.
We highly recommend downloading the latest version and trying it out!

# What's New and Improved in the V2 Models?

# TL;DR – Summary

V2 models deliver better accuracy 🔍, multilingual support 🌍, and improved object localization 🎯. They're also more fair & diverse 🏳️🌈, reducing biases in search results.
V1 models remain available for compatibility, but V2 is recommended for most users.

# 🔍 Better Accuracy

The V2 models perform better on all tasks relevant to Jumper compared to V1 models.

# 🌍 Multilingual Support

All V2 models can handle many languages, making it possible to search accurately in any language.
This also improves performance when searching for frames containing non-English text or concepts.

# 📍 Better Localization

The V2 models are better at pinpointing objects and regions within your footage.
This leads to more accurate search results, especially when videos contain multiple elements or small details.

# 🏳️🌈 Cultural Diversity & Fairness

The V2 models are trained to represent a wider range of cultural contexts and reduce biased associations.
For example, V2 models are much less likely to misattribute neutral objects (like cars) as predominantly "men's" or "women's" items.
They also perform better at identifying diverse images across different cultures and geographic regions.

The V1 models are still included for backward compatibility, but we recommend V2 models for nearly all use cases. (Note: Certain V1 multilingual models may still slightly outperform V2 in a few specific languages such as Hindi or Thai.)

# Model List

Below is a quick-reference table for both V2 models and V1 models we currently offer. Each model's naming corresponds to how you'd see it inside Jumper.

Model Name (Display) Resolution Size on Disk Information
V2 Medium (Default) 256×256 ~1.5GB Default, bundled in app installer
Balanced accuracy vs. performance, improved semantic understanding from V2 model improvements. Works well on most modern computers.
V2 Multilingual x-high-res 512x512 ~2GB NEW, MOST ACCURATE MODEL 💪 Fantastic search result quality in both non-English and English languages. Improves accuracy on multilingual benchmarks by 50% compared to default model. Highly recommended!
V2 Medium high-res 384×384 ~1.5GB Identical parameter count as V2 Medium but processes frames at 384×384 for more image detail. Good if you need slightly finer detail than 256×256, but be aware it requires more RAM/VRAM.
V2 Medium x-high-res 512×512 ~1.5GB Same parameter count as V2 Medium, but even higher resolution (512×512). Ideal for text detection/OCR or detailed reverse image searches. Uses significantly more memory and produces larger analysis files.
V2 Large 256×256 ~3.3GB Larger model with higher accuracy in challenging scenarios.
V2 Large high-res 384×384 ~3.3GB Same as V2 Large but with higher resolution (384×384). Ideal for text detection/OCR or detailed reverse image searches.
V2 Large x-high-res 512×512 ~3.3GB Same parameters as V2 Large, but at 512×512 input resolution. Ideal for text detection/OCR or detailed reverse image searches.
V2 XLarge 256×256 ~4.5GB Even larger V2 model. Offers improved recognition of subtle elements. Ideal if you want top-tier accuracy and have the computer to run it.
V2 XLarge high-res 384×384 ~4.5GB A higher-resolution variant of V2 XLarge.
V2 XLarge x-high-res 512×512 ~4.5GB The heaviest V2 model in terms of resolution and parameter requirements. Even more accurate, even more resource demanding.
Medium 256×256 ~812MB Legacy previous default model (V1), previously bundled with the application. Reasonably accurate but outperformed by V2 Medium in most scenarios. If your hardware can handle V2 Medium, prefer that instead.
Medium x-high-res 512×512 ~812MB Same as V1 Medium but at 512×512. Useful for text detection, reverse image searches, etc. Produces larger analysis files than some bigger V1 models, purely due to the higher frame resolution.
Large multilingual 256×256 ~1.48GB V1 Multilingual version that improves accuracy for non-English text. V2 equivalents are multilingual by default.
Large 256×256 ~2.61GB Larger V1 model. Prefer V2 alternatives.
Large high-res 384×384 ~2.61GB Same as V1 Large but higher resolution. Demands more resources.
XLarge high-res 384×384 ~3.51GB Largest V1 model. High accuracy, but overshadowed by the new V2 XLarge. Recommended only for legacy compatibility if you can't run V2.
XLarge multilingual 256×256 ~4.51GB Largest V1 multilingual model. Very high accuracy for non-English searches - V2 models are multilingual by default, but V1 multilingual models can be slightly better than V2 models on certain languages.

One final notehigh-resolution models will not necessarily be slower to use during normal queries (e.g., searching in your pre-analyzed footage), but the initial analysis phase demands more hardware resources and creates larger analysis files. For instance, "Medium x-high-res" may generate bigger analysis files than "V2 XLarge", simply because the former uses 512×512 frames.


# Transcription Models

For transcriptions, Jumper uses Whisper models developed by OpenAI. Specifically, the whisper-large-v3-turbo variant is used. The choice of model depends on your platform:

Platform Model Variant Size on Disk Notes
Windows / Intel Mac whisper-large-v3-turbo ~1.62GB Bundled with Windows and Intel Mac installers.
Apple M-series Macs whisper-large-v3-turbo ~467MB Uses a quantized version converted to Apple's MLX framework for hardware acceleration (faster analysis, smaller model size). Bundled with Apple M-series installer.

Note: The quantized M-series model maintains the same number of parameters but uses fewer decimal points (lower precision) to reduce disk size and leverage Apple's hardware optimizations.


# References

All search models in Jumper are based on the following research:

  1. Sigmoid Loss for Language Image Pre-Training (Zhai, Mustafa, Kolesnikov, Beyer, 2023) https://arxiv.org/abs/2303.15343

  2. SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features (Tschannen et al., 2025) https://arxiv.org/abs/2502.14786

  3. Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design (Alabdulmohsin, Zhai, Kolesnikov, Beyer, 2024) https://arxiv.org/abs/2305.13035

  4. Learning Transferable Visual Models From Natural Language Supervision (Radford et al., 2021) https://arxiv.org/abs/2103.00020

  5. Robust Speech Recognition via Large-Scale Weak Supervision (Radford, Kim, Xu, Brockman, McLeavey, Sutskever, 2022) https://arxiv.org/abs/2212.04356

  6. MEXMA: Token-level objectives improve sentence representations (Janeiro, Piwowarski, Gallinari, Barrault, 2024) https://arxiv.org/abs/2409.12737