> ## Documentation Index > Fetch the complete documentation index at: https://docs.getjumper.io/llms.txt > Use this file to discover all available pages before exploring further. # Machine Learning > Understanding how Jumper uses AI models to search your footage visually, transcribe speech, and detect faces ## Overview Jumper uses three types of machine learning models to help you search through your footage: AI systems that can "see" and understand images and video frames, then connect that understanding to text queries or other images. AI systems that can transcribe spoken words into searchable text. AI systems that can identify and group similar faces together, allowing you to search for specific people. All models run entirely on your local machine, ensuring your footage never leaves your computer. ## Visual Search Models Visual search models are AI systems that can "see" and understand images and video frames, then connect that understanding to text queries or other images. This is what makes Jumper's visual search work. **How visual search models work:** These models learn to find correlations between text and visual data from their training material. When you search for something like "a person walking through a door," the model has learned to understand: * What "a person" looks like visually * What "walking" means in terms of motion and pose * What "a door" is and how it appears in different contexts * How these elements relate to each other in a scene When you perform a text search, the model identifies the best matching visual elements in your footage. Thanks to efficient algorithms, this makes it possible to find visual content across thousands of hours of video. The model doesn't just match keywords. It understands the semantic meaning of your search query and finds frames that match that meaning, even if the exact words don't appear in the footage. **Search methods:** * **Text search**: Enter a natural language query to find matching visual content * **Image search**: Use an image or frame as your search input instead of text. The model uses similar algorithms to find visually similar content across your footage **What this means for you:** * You can search using natural, conversational language * You can search using images or frames from your footage * The model understands context and relationships between objects * It works across different languages (multilingual models) * Higher resolution models can detect finer details like text on signs or small objects **Model characteristics:** * **Size**: Larger models generally offer better accuracy but require more memory and processing power * **Resolution**: Higher resolution models (384×384, 512×512) analyze frames in more detail, useful for detecting text, small objects, or fine visual details * **Speed**: Smaller models analyze faster, while larger models take more time but provide better results For a complete list of available models and their specifications, see the [List of Machine Learning Models](/reference/machine-learning-models). ## Speech Models For speech search, Jumper uses **Whisper** models developed by OpenAI. These models transcribe spoken words in your footage, making dialogue searchable. **How speech models work:** The Whisper model listens to the audio track of your media files and converts speech into text. This transcription happens during the analysis phase, so once your media is analyzed, you can search for any word or phrase that was spoken. **Key features:** * Supports 111 languages automatically * Handles different accents, background noise, and audio quality * Transcribes dialogue with timestamps * Works entirely offline with no cloud processing required **Platform differences:** The exact Whisper model variant depends on your platform. Apple M-series Macs use an optimized version that takes advantage of Apple's hardware acceleration, resulting in faster analysis and a smaller model size. ## Face Detection Models Face detection requires a Jumper Pro license Face detection uses specialized AI models to identify and recognize people in your footage. These models analyze faces frame by frame, extract facial features, and group similar faces together. **How face detection models work:** The face detection algorithm processes each frame of your media to: * Detect faces in the frame * Group similar faces together, even when lighting, angles, or expressions change **What this means for you:** * Automatically identify all appearances of a person across your entire footage library * Search for specific people using the `@` syntax (e.g., `@John sitting on a bench`) * Organize people into Collections to keep different productions separate * Find every scene where someone appears, even if they're in the background For more information about using face detection, see [Face detection](/core-concepts/face-detection). ## Choosing the Right Model Depending on your specific workflow, usecase and hardware setup, you might want to choose a specific model other than the default. Some suggestions are listed below, or feel free to explore from all the available models. Quick analysis with good accuracy, 256x256 resolution. Great for most workflows. Larger model with higher accuracy, 256x256 resolution. Top-tier accuracy with 384×384 resolution. Benefits from more detailed and verbose searches, great multilingual support. Highest accuracy with 512x512 resolution among the V1/V2 presets. Benefits from more detailed and verbose searches, great multilingual support. State-of-the-art visual search quality using a different model architecture than the presets above. Available on Apple silicon Macs and Windows. Requires a separate download, more compute, and produces larger analysis files. Choose this when search quality matters more than analysis speed. For a complete list of all available models with detailed specifications, see the [List of Machine Learning Models](/reference/machine-learning-models).