Train fast computer vision models with slower multimodal AI

AI models that have been trained on billions of parameters are very capable, but not always suitable for real-time use. However, they can reduce human effort by automatically supervising annotation of fast purpose-built models.
‍
If you’ve ever built a computer vision model, including object detection or image classification models, you know that a large amount of effort is required for supervision—a human taking the time (hours or days) to draw bounding boxes and add labels to annotate images for training. The end result is a fast machine learning model that can be used for real time detection and automation.
‍
However, the requirement of human involvement slows down not only the initial training of a model, but also any iterative fine-tuning or improvement of the model as more data is collected.

The rise of VLMs

In the past year, we’ve seen the rise of multi-modal LLMs (sometimes referred to as VLMs, or Vision Language Models) AI models that have been trained on billions of parameters and have the capability to look at images in sophisticated ways.

While traditional computer vision models typically are trained to identify base classes like “person” or “dog”, VLMs can use their language abilities to identify things like “person wearing glasses” or “shaggy brown dog” with specificity, without additional training.

However, whether using a very large cloud VLM like ChatGPT or a more compact but capable local LLM like Moondream, these inferences take seconds and are therefore typically too slow for many real-time scenarios.

If VLMs are currently “smart enough” to perform complex visual inference, but too slow for real-time tasks, can they help in other ways? We are now seeing that the answer is yes. These slower but more capable models can be used to replace humans in supervisory annotation tasks.

A “two phase” auto-labeling approach

Zero-shot models like Grounding DINO and Segment Anything have been leveraged by projects like autodistill, as they can identify many common classes for annotation despite being fairly slow on their own. This makes creating fast custom computer vision models possible without human supervision. However, these models have a tradeoff: while they can identify many simple classes, they don’t have the capability of full VLMs to perform full contextual inferences.

Let’s pretend we want to use computer vision to market eyewear products to those that wear glasses on electronic signs while people are walking by on city streets. We need it to be responsive in real-time, and for privacy reasons we can’t send images into the cloud for inference; it needs to happen on-device. Therefore, we will train a fast, local model to differentiate between those wearing glasses and those not wearing glasses.

A man wearing glasses, with graphic overlay identifying the glasses — Phrase grounding involves associating specific regions of an image with corresponding phrases, for example, "person" and "glasses"

Using a grounding model like Grounding DINO gets us part way there: it uses a method called phrase grounding, which involves associating specific regions of an image with corresponding phrases in a textual description—which adds to the computational load, making the model relatively slow. When asked to identify “person wearing glasses”, it correctly identifies “person” and “glasses”.

We could consider doing some calculations to determine if “glasses” are detected within the same region as the “person”, but this can get complicated; especially if we were instead attempting to determine if someone were wearing glasses on their face (not around their neck on top of their head).

A green square overlayed on man wearing glasses — Using a VLM as a "second phase" beyond the grounding model can more accurately determine a person is wearing glasses

Instead, we can leverage a VLM as a “second phase”, as it can more accurately determine if the “person” that Grounding Dino detected is in fact wearing glasses.

If both our grounding model and VLM affirm that there is a person wearing glasses, this information can be used to automatically add an annotated image to our dataset. Once we’ve collected enough annotated images for “person with glasses” and “person without glasses”, a human can optionally review the annotated data, and the model can be trained.

A diagram showing the 2-phase computer vision model — A two-phase process for using a grounding model and VLM together

Many of the models that can be used are open source, and you can build your own collection and training pipelines to try VLM-based auto-annotation using them and the programming language of your choice. There are also platforms like Viam that offer data and ML solutions to streamline this process.

For example, Viam has the ability for you to set up data collection and filtering into a cloud-based data set without coding, and also offers in-cloud model training abilities. Models like Grounding Dino, ChatGPT, and Moondream can be configured as services, and a service called “Auto label filter” leverages the selected models to perform supervision in “two phases” as described above. Roboflow is another platform that offers an Auto-Label product that uses grounding models to create annotations.

Regardless of how you choose to implement these technologies, the result will be a fully automated training pipeline that substantially reduces the time spent in your machine learning training tasks. If you want to get started today, check out this repository that wraps all required functionality - you can use it for free with a Viam machine you create in just a few clicks.

User interface of Viam app showing 'person with glasses' tagged results — Correctly identified 'person with glasses'

‍

Train fast computer vision models with slower multimodal AI

The rise of VLMs

A “two phase” auto-labeling approach

Find us at our next event

Elastic New York Meetup

Shift Miami

Deploying and scaling AI with hardware

JS Nation

Open Source Summit North America 2025