Your images,
intelligently sorted
ImageSieve uses CLIP to understand the semantic content of your photos and automatically sort them into categories you define with natural language. On-device inference via CoreML means your images never leave your machine.
Everything you need to
understand your images
A complete AI vision toolkit for categorization, search, analysis, and visualization.
AI Categorization
Define categories with natural language descriptions like "a photo of a cat" or "scenic mountain landscape." ImageSieve uses CLIP ViT-L-14 to match images semantically -- no manual tagging, no keyword matching. 33 pre-configured categories included out of the box.
Semantic Search
Search your library with text, images, or both. Type "sunset over the ocean" and find matching photos ranked by CLIP similarity. Drop a reference image to find visually similar ones. Adaptive filtering uses mean + standard deviation for smart result thresholds.
Vision Tagging
Apple Vision framework automatically detects scenes, animals, faces, text, and objects in your images with 1000+ classification labels. Tags are searchable and visible in the image inspector alongside CLIP categories and EXIF data.
EXIF Extraction
Camera make, model, lens, focal length, aperture, shutter speed, ISO, GPS coordinates, and more. All metadata is extracted from JPEG, HEIC, and RAW files. Browse by camera or lens, and explore shooting patterns in the EXIF Explorer visualization.
Duplicate Detection
Find duplicate and visually similar images using CLIP vector comparison. Three threshold presets -- exact, similar, and loose -- let you dial in how aggressively to flag matches. Review side-by-side and keep only the shots you want.
3D Visualizations
Four Metal-rendered interactive views to explore your library in 3D. Cluster Space projects CLIP embeddings via t-SNE. Tag Nebula maps Vision labels. Photo Map plots GPS coordinates. EXIF Explorer visualizes camera settings. All rendered at 60fps with Metal.
On-Device Inference
CoreML runs CLIP ViT-L-14 entirely on your device using the Apple Neural Engine and Metal GPU. No cloud, no server, no API keys. Your photos never leave your machine. The CLI tool can also connect to a local or remote CLIP server for batch processing.
Spotlight & Siri
Categorized images are indexed in Spotlight search -- find photos by category name from anywhere on your Mac. Siri Shortcuts for "Get Library Statistics" and "List Categories" give you voice access to your library metadata.
CLI Tool
Batch-process thousands of images from the command line with the Go-based CLI. Connects to a CLIP server (Python or Swift) over HTTP for maximum throughput. Supports YAML configuration for categories, thresholds, and output formats.
How CLIP matching works
From natural language to semantic image understanding
Define categories in natural language
Create categories using plain English descriptions. Be as specific or broad as you like -- CLIP understands semantic meaning, not just keywords. Each category has a tuned distance threshold that controls how strict the matching is.
categories:
- name: landscapes
query: "a landscape photo of nature, mountains, or scenic view"
maxDistance: 1.30
- name: portraits
query: "a portrait photo of a person, headshot, or face"
maxDistance: 1.28
- name: cats
query: "a photo of a cat or kitten"
maxDistance: 1.25
- name: food
query: "a photo of food, a meal, or cooking"
maxDistance: 1.28
- name: architecture
query: "a photo of a building, architecture, or interior design"
maxDistance: 1.30CLIP encodes images & text into vectors
Both your images and category descriptions are converted into 768-dimensional vectors by the CLIP ViT-L-14 model. Images and text share the same embedding space, so semantically similar content ends up near each other. On the native app, this runs entirely on-device via CoreML and the Apple Neural Engine.
angular_distance = sqrt(2 * (1 - cosine_similarity))Angular distance determines the match
The angular distance between each image vector and all category vectors is calculated. The closest match below the threshold wins. Images with no close match go to "uncategorized." Lower thresholds (1.20-1.25) mean stricter matching for very specific categories, while higher thresholds (1.32-1.38) allow broader, more abstract categories.
Explore the vector space
CLIP encodes every image into a 768-dimensional vector. ImageSieve projects these into 3D using t-SNE, revealing how your photos cluster by semantic similarity. Drag to rotate. Scroll to zoom.
Each point represents a photo. Colors indicate the assigned category. Nearby points share semantic meaning -- even if they look nothing alike to the human eye. The native app renders this in real-time using Metal shaders and Apple's Accelerate framework for vectorized t-SNE.
See it in action
33 pre-configured categories with tuned distance thresholds, plus semantic search

Ready to organize your photo library?
Stop manually sorting thousands of images. Let CLIP AI understand your photos and categorize them with natural language -- on-device, private, and fast.