|
|
Image Classification with Hugging Face Vision Transformers
Author: Venkata Sudhakar
Image classification assigns a label to an image from a fixed set of categories. ShopMax India uses image classification to automatically tag uploaded product photos by category - TV, laptop, headphone, refrigerator - so sellers can list products faster and customers can search by visual similarity without manual labelling by the operations team.
Hugging Face supports Vision Transformer (ViT) models through the image-classification pipeline. ViT treats an image as a sequence of fixed-size patches and applies transformer attention across them, achieving accuracy comparable to CNNs on ImageNet. The pipeline accepts local file paths, URLs, or PIL Image objects and returns top-k predicted labels with confidence scores.
The example below loads a ViT model and classifies product images for ShopMax India's electronics catalogue, returning the top 3 predicted categories for each image.
It gives the following output,
SKU: SKU-TV-001
lemon: 21.3%
orange: 18.7%
banana: 9.2%
SKU: SKU-LAPTOP-042
ant: 98.6%
black widow: 0.4%
bee: 0.2%
For ShopMax India's electronics catalogue, fine-tune the ViT model on your own labelled product images using Hugging Face Trainer API to get categories like TV, Laptop, Headphone, and Refrigerator. Use AutoFeatureExtractor for preprocessing consistency during training and inference. Cache the model locally after first download to avoid repeated network calls in production.
|
|