# Visual Search Models

All visual search models in Jumper are based on this paper by researchers at Google DeepMind:
Sigmoid loss for language image pre-training (Zhai, Mustafa, Kolesnikov, Beyer, 2023) [1].
Additionally, the XLarge and XLarge multilingual models incorporate techniques introduced by researchers at Google DeepMind in this paper:
Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design (Alabdulmohsin, Zhai, Kolesnikov, Beyer, 2024) [2].

Honorable mention to this seminal work by researchers at OpenAI, which lays the foundation that the above two papers build upon:
Learning Transferable Visual Models From Natural Language Supervision (Radford et al., 2021) [3].

Recommendation: For most use cases, we suggest using at least the Large variant. A MacBook Pro (M1 Pro, 16GB RAM) can handle everything except the XLarge and Medium x-high-res models comfortably.
This is due to my MBP "only" having 16GB of unified memory. From my tests, 16GB of unified memory maxes out around the same point as a 6GB VRAM Nvidia GPU.

Changing models will require you to re-analyse your footage!
However, if you have previously analysed a file using a model with the same resolution as the new one, the processing will start from 50% done (frame extraction already done).

Models are downloaded to the .jumper/models directory in your HOME directory.
For me (Max), that would be /Users/max/.jumper/models on macOS and C:/Users/Max/.jumper/models on Windows.

Model Resolution Parameters Size on Disk Notes
Medium 256×256 ~203M ~812MB Base model bundled with the application.
Medium x-high-res 512×512 ~203M ~812MB Same parameters/size as Medium, but higher resolution than any other model. Improves certain tasks (e.g. reverse image searches, text detection/OCR) due to increased frame detail.
Large multilingual 256×256 ~370M ~1.48GB Multilingual version that dramatically increases accuracy for non-English searches.
Large 256×256 ~652M ~2.61GB Larger version of the Medium model.
Large high-res 384×384 ~652M ~2.61GB Same as Large, but using higher resolution frames.
XLarge 384×384 ~877M ~3.51GB Even larger model, uses techniques from paper [2].
XLarge multilingual 256×256 ~1.13B ~4.51GB Multilingual version using paper [2], providing the highest accuracy for non-English searches.

One final note - high resolution models will not be slower to use compared to an identical lower-resolution variant.
But! It will demand higher RAM/VRAM consumption during the analysis phase AND create larger analysis files compared to lower resolution variants, meaning that e.g. "Medium x-high-res" produces larger analysis files than the larger model "XLarge multilingual", since the former uses a higher frame resolution.


# Transcription Models

For transcriptions, we use Whisper models developed from research at OpenAI:
Robust Speech Recognition via Large-Scale Weak Supervision (Radford, Kim, Xu, Brockman, McLeavey, Sutskever, 2022) [4].

Specifically, the whisper-large-v3-turbo variant is used. The choice of model depends on the platform:

Platform Model Variant Parameters Size on Disk Notes
Windows / Intel Mac whisper-large-v3-turbo ~809M ~1.62GB Bundled with Windows and Intel Mac installers.
Apple M-series Macs whisper-large-v3-turbo ~809M ~467MB Uses a quantized version converted to Apple’s MLX framework for hardware acceleration (faster analysis, smaller model size). Bundled with Apple M-series installer.

Note: The quantized M-series model maintains the same number of parameters but uses fewer decimal points (lower precision) to reduce disk size and leverage Apple hardware optimizations.


# References

[1]

@misc{zhai2023sigmoidlosslanguageimage,
      title={Sigmoid Loss for Language Image Pre-Training}, 
      author={Xiaohua Zhai and Basil Mustafa and Alexander Kolesnikov and Lucas Beyer},
      year={2023},
      eprint={2303.15343},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2303.15343}, 
}

[2]

@misc{alabdulmohsin2024gettingvitshapescaling,
      title={Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design}, 
      author={Ibrahim Alabdulmohsin and Xiaohua Zhai and Alexander Kolesnikov and Lucas Beyer},
      year={2024},
      eprint={2305.13035},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2305.13035}, 
}

[3]

@misc{radford2021learningtransferablevisualmodels,
      title={Learning Transferable Visual Models From Natural Language Supervision}, 
      author={Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
      year={2021},
      eprint={2103.00020},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2103.00020}, 
}

[4]

@misc{radford2022robustspeechrecognitionlargescale,
      title={Robust Speech Recognition via Large-Scale Weak Supervision}, 
      author={Alec Radford and Jong Wook Kim and Tao Xu and Greg Brockman and Christine McLeavey and Ilya Sutskever},
      year={2022},
      eprint={2212.04356},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2212.04356}, 
}