Powered by OpenAI CLIP ViT-L-14

Your images,
intelligently sorted

ImageSieve uses CLIP to understand the semantic content of your photos and automatically sort them into categories you define with natural language. On-device inference via CoreML means your images never leave your machine.

ImageSieve — Categorizing 847 images...
FileBest CategoryDistanceStatus
IMG_4821.heiclandscapes0.82COPY
DSC_0034.jpgportraits0.91COPY
photo_2024.pngcats0.76COPY
DCIM_1087.jpgarchitecture0.95COPY
Screenshot.pngscreenshots1.02COPY
random_img.jpguncategorized1.41SKIP
5 categorized|1 uncategorized|33 categories active
Native macOS & iOS App
+
Go CLI for Batch Processing
+
On-Device CoreML Inference

Everything you need to
understand your images

A complete AI vision toolkit for categorization, search, analysis, and visualization.

AI Categorization

Define categories with natural language descriptions like "a photo of a cat" or "scenic mountain landscape." ImageSieve uses CLIP ViT-L-14 to match images semantically -- no manual tagging, no keyword matching. 33 pre-configured categories included out of the box.

Semantic Search

Search your library with text, images, or both. Type "sunset over the ocean" and find matching photos ranked by CLIP similarity. Drop a reference image to find visually similar ones. Adaptive filtering uses mean + standard deviation for smart result thresholds.

Vision Tagging

Apple Vision framework automatically detects scenes, animals, faces, text, and objects in your images with 1000+ classification labels. Tags are searchable and visible in the image inspector alongside CLIP categories and EXIF data.

EXIF Extraction

Camera make, model, lens, focal length, aperture, shutter speed, ISO, GPS coordinates, and more. All metadata is extracted from JPEG, HEIC, and RAW files. Browse by camera or lens, and explore shooting patterns in the EXIF Explorer visualization.

Duplicate Detection

Find duplicate and visually similar images using CLIP vector comparison. Three threshold presets -- exact, similar, and loose -- let you dial in how aggressively to flag matches. Review side-by-side and keep only the shots you want.

3D Visualizations

Four Metal-rendered interactive views to explore your library in 3D. Cluster Space projects CLIP embeddings via t-SNE. Tag Nebula maps Vision labels. Photo Map plots GPS coordinates. EXIF Explorer visualizes camera settings. All rendered at 60fps with Metal.

On-Device Inference

CoreML runs CLIP ViT-L-14 entirely on your device using the Apple Neural Engine and Metal GPU. No cloud, no server, no API keys. Your photos never leave your machine. The CLI tool can also connect to a local or remote CLIP server for batch processing.

Spotlight & Siri

Categorized images are indexed in Spotlight search -- find photos by category name from anywhere on your Mac. Siri Shortcuts for "Get Library Statistics" and "List Categories" give you voice access to your library metadata.

CLI Tool

Batch-process thousands of images from the command line with the Go-based CLI. Connects to a CLIP server (Python or Swift) over HTTP for maximum throughput. Supports YAML configuration for categories, thresholds, and output formats.

How CLIP matching works

From natural language to semantic image understanding

01

Define categories in natural language

Create categories using plain English descriptions. Be as specific or broad as you like -- CLIP understands semantic meaning, not just keywords. Each category has a tuned distance threshold that controls how strict the matching is.

categories.yaml
categories:
  - name: landscapes
    query: "a landscape photo of nature, mountains, or scenic view"
    maxDistance: 1.30

  - name: portraits
    query: "a portrait photo of a person, headshot, or face"
    maxDistance: 1.28

  - name: cats
    query: "a photo of a cat or kitten"
    maxDistance: 1.25

  - name: food
    query: "a photo of food, a meal, or cooking"
    maxDistance: 1.28

  - name: architecture
    query: "a photo of a building, architecture, or interior design"
    maxDistance: 1.30
02

CLIP encodes images & text into vectors

Both your images and category descriptions are converted into 768-dimensional vectors by the CLIP ViT-L-14 model. Images and text share the same embedding space, so semantically similar content ends up near each other. On the native app, this runs entirely on-device via CoreML and the Apple Neural Engine.

beach.jpg[0.023, -0.156, 0.891, 0.342, -0.018, ... ]768-dim
"landscapes"[0.041, -0.142, 0.877, 0.315, -0.033, ... ]768-dim
"portraits"[-0.312, 0.654, 0.119, -0.445, 0.287, ... ]768-dim
"food"[0.187, 0.033, -0.562, 0.711, 0.094, ... ]768-dim
angular_distance = sqrt(2 * (1 - cosine_similarity))
03

Angular distance determines the match

The angular distance between each image vector and all category vectors is calculated. The closest match below the threshold wins. Images with no close match go to "uncategorized." Lower thresholds (1.20-1.25) mean stricter matching for very specific categories, while higher thresholds (1.32-1.38) allow broader, more abstract categories.

CategoryDistanceThresholdResult
landscapes0.821.30MATCH
food1.211.28skip
portraits1.311.28skip
cats1.451.25skip
architecture1.481.30skip

Explore the vector space

CLIP encodes every image into a 768-dimensional vector. ImageSieve projects these into 3D using t-SNE, revealing how your photos cluster by semantic similarity. Drag to rotate. Scroll to zoom.

Landscapes32
Portraits28
Cats22
Architecture25
Food20
Animals26
Travel22
Sports18

Each point represents a photo. Colors indicate the assigned category. Nearby points share semantic meaning -- even if they look nothing alike to the human eye. The native app renders this in real-time using Metal shaders and Apple's Accelerate framework for vectorized t-SNE.

See it in action

33 pre-configured categories with tuned distance thresholds, plus semantic search

cats1.25
guitars1.25
motorbikes1.25
dreadlocks1.25
animals1.28
boxing1.28
cannabis1.28
instruments1.28
electronics1.28
books1.28
tickets1.28
tattoos1.28
food1.28
portraits1.28
landscapes1.30
architecture1.30
furniture1.30
hairstyles1.30
transport1.30
cooking1.32
sports1.32
restaurants1.32
travel1.32
festivals1.32
fashion1.32
art1.32
programming1.32
golang1.32
events1.35
datascience1.35
hacking1.35
memes1.38
screenshots1.38
1.20 - 1.25Very StrictOnly very close semantic matches
1.25 - 1.28StrictSpecific visual objects
1.28 - 1.30ModerateBalanced precision and recall
1.30 - 1.35RelaxedBroader categories
1.35 - 1.38LooseAbstract or diverse categories
Categories are fully customizable via YAML configuration or the native app's category editor
ImageSieve

Ready to organize your photo library?

Stop manually sorting thousands of images. Let CLIP AI understand your photos and categorize them with natural language -- on-device, private, and fast.

Native macOS & iOS App
On-Device CoreML Inference
33 Pre-configured Categories