List of Machine Learning Models

Visual models

This is a more comprehensive list of the models and their properties.

Model Name (Display)	Resolution	Size on Disk	Information
Ultra Accurate	—	Download required	Highest-accuracy visual model. A different architecture from the V1/V2 models below. State-of-the-art search quality on Apple silicon Macs and Windows. Requires more compute and produces larger analysis files than the Accurate presets. Not available on Intel Macs.
V2 Medium (Default)	256×256	~1.5GB	Default, bundled in app installer Balanced accuracy vs. performance, improved semantic understanding from V2 model improvements. Works well on most modern computers.
V2 Multilingual x-high-res	512x512	~2GB	Excellent search result quality in both non-English and English languages. Benefits from more detailed and “verbose” searches - e.g. “a metal sign saying XYZ” instead of just “XYZ”
V1 Multilingual high-res	384x384	~4GB	Very accurate search result quality in both non-English and English languages. Benefits from more detailed and “verbose” searches - e.g. “a metal sign saying XYZ” instead of just “XYZ”
V2 Medium high-res	384×384	~1.5GB	Identical parameter count as V2 Medium but will analyze frames at 384×384 for more image detail. Good if you need slightly finer detail than 256×256, but be aware it requires more RAM/VRAM.
V2 Medium x-high-res	512×512	~1.5GB	Same parameter count as V2 Medium, but even higher resolution (512×512). Ideal for text detection/OCR or detailed reverse image searches. Uses significantly more memory and produces larger analysis files.
V2 Large	256×256	~3.3GB	Larger model with higher accuracy in challenging scenarios.
V2 Large high-res	384×384	~3.3GB	Same as V2 Large but with higher resolution (384×384). Ideal for text detection/OCR or detailed reverse image searches.
V2 Large x-high-res	512×512	~3.3GB	Same parameters as V2 Large, but at 512×512 input resolution. Ideal for text detection/OCR or detailed reverse image searches.
V2 XLarge	256×256	~4.5GB	Even larger V2 model. Offers improved recognition of subtle elements. Ideal if you want top-tier accuracy and have the computer to run it.
V2 XLarge high-res	384×384	~4.5GB	A higher-resolution variant of V2 XLarge.
V2 XLarge x-high-res	512×512	~4.5GB	The heaviest V2 model in terms of resolution and parameter requirements. Even more accurate, even more resource demanding.
Medium	256×256	~812MB	Legacy previous default model (V1), previously bundled with the application. Reasonably accurate but outperformed by V2 Medium in most scenarios. If your hardware can handle V2 Medium, prefer that instead.
Medium x-high-res	512×512	~812MB	Same as V1 Medium but at 512×512. Useful for text detection, reverse image searches, etc. Produces larger analysis files than some bigger V1 models, purely due to the higher frame resolution.
Large multilingual	256×256	~1.48GB	V1 Multilingual version that improves accuracy for non-English text. V2 equivalents are multilingual by default.
Large	256×256	~2.61GB	Larger V1 model. Prefer V2 alternatives.
Large high-res	384×384	~2.61GB	Same as V1 Large but higher resolution. Demands more resources.
XLarge high-res	384×384	~3.51GB	Largest V1 model. High accuracy, but overshadowed by the new V2 XLarge. Recommended only for legacy compatibility if you can’t run V2.
XLarge multilingual	256×256	~4.51GB	Largest V1 multilingual model. Very high accuracy for non-English searches - V2 models are multilingual by default, but V1 multilingual models can be slightly better than V2 models on certain languages.

Speech models

For transcriptions, Jumper uses Whisper models developed by OpenAI. The exact model depends on your platform.

Platform	Model Variant	Size on Disk	Notes
Windows / Intel Mac	whisper-large-v3-turbo	~1.62GB	Bundled with Windows and Intel Mac installers.
Apple M-series Macs	whisper-large-v3-turbo	~467MB	Uses a quantized version converted to Apple’s MLX framework for hardware acceleration (faster analysis, smaller model size). Bundled with Apple M-series installer.

Language-specific models

Beyond the default Whisper model, Jumper offers more accurate speech-to-text models for specific languages, including Chinese (20+ dialects), Arabic, Hindi, Cantonese, Korean, Thai, Vietnamese, Filipino, and Russian. Some languages also have state-of-the-art fine-tuned models, including Japanese, Swedish, and Moroccan Arabic (Darija). When a dedicated model is available for the language you’re transcribing, selecting that language during analysis gives more accurate results than auto-detection. For the full language list, see Supported Languages.

Get Started

Core concepts

Platforms

Guides

Reference

Common issues

List of Machine Learning Models

Visual models

Speech models

Language-specific models

​Visual models

​Speech models

​Language-specific models

Visual models

Speech models

Language-specific models