Cool Stuff
August 20, 2024

YOLO model for real-time object detection: A full guide

Written by
Sara Gharnit
Senior Content Marketer

Imagine watching a fast-paced sports game or a busy street and being able to instantly identify and track every moving object with just a single glance. That’s YOLO in action—processing images and videos rapidly to detect objects in real-time.

But what exactly is a YOLO model, and why is it such a big deal?

In this guide, I’ll take you through the ins and outs of YOLO, from its origins and evolution to how you can use these models yourself.

What is object detection?

This image shows what an image looks like without object detection and how that same image looks with object detection. The one with has bounding boxes around the people and flowers found in the image.
Object detection is a way of localizing where specific objects are within an image. In this example, you can see people and flowers being detected.

To understand YOLO models and how they work, you need to understand object detection. Object detection involves identifying and locating objects within an image.

This is typically done using bounding boxes—regions within an image defined by x and y coordinates. These boxes are visualized as rectangular borders that highlight and approximate the location of each object within the scene.

This image shows how object detection looks on a busy street near Grand Central in NYC. It has bounding boxes in teal around the people and bounding boxes in a bright pink around cars.
Showing how object detection is applied to a busy street with Viam’s app, using bounding boxes to identify the designated objects of people and cars.

There are numerous algorithms and models available to accomplish this, each with varying levels of accuracy, speed, and complexity.

This image shows how Haar-like features are applied on the eye region and the bridge of the nose to detect face in images.
An example of how Haar-like features work for algorithms, like the Viola-Jones Detector. 

These methods range from traditional approaches, such as algorithms like the Viola-Jones Detector, which uses Haar-like features to detect certain patterns, to more advanced deep learning models. 

Among the most popular and effective of these deep learning models for real-time detection is YOLO.

What is the YOLO model?

YOLO stands for "You Only Look Once." As the name suggests, this model type only needs to “glance” at an image once, using a single pass to identify and locate objects. This approach varies from multi-pass algorithms like R-CNN, which require multiple passes to achieve similar results.

YOLO models have an architecture that consists of a backbone, neck, and head.

A diagram showing the architecture of single-stage object detectors.
Diagram showing how single-stage detectors perform object detection in a single pass through the network. 

These models identify the bounding box coordinates, confidence scores, and class probabilities all at once.

A diagram showing A look at how YOLO object detection models work in a single pass through the network.
A look at how YOLO object detection models work in a single pass through the network.

How does the YOLO object detection model work?

Imagine you’re a retail store owner who wants to monitor how many people are coming in and out of the store at all times, while also using object detection for security purposes. You decide on using a YOLO model, because it’s effective for detecting people in fast-moving videos. 

Here’s how the YOLO model would work behind the scenes: 

1 - Image division into a grid

This image shows the first step of the YOLO model: overlaying the visual with a grid output.
The first step of the YOLO model: overlaying the visual with a grid output.

The input image is divided into a grid, with each grid cell responsible for detecting objects whose center falls within it.

2 - Bounding box prediction

Each grid cell predicts a set number of bounding boxes (usually 2 or more). 

From there, a confidence score—the certainty that a particular bounding box contains an object—is calculated by multiplying the objectness score and the class probability.

  • Class probability (P(class | object)): The probability that the detected object belongs to a particular class, given that an object is present in the bounding box.
This image shows how the class probability is found within an image. The grids where people (what the model is looking for) are located are considered significant grids.
Finding the class probability for the security camera image.
  • Objectness score (P(object)): The probability that an object is present in the predicted bounding box. This is also referred to as the objectness score.

3 - Intersection over Union (IoU)

This image show how the IoU (Intersection over Union) is found within a specific area of the image.
Finding the IoU of a specific portion of the security camera image. 

Ranging from 0 to 1, the IoU measures the overlap between the predicted bounding box and the ground truth (actual bounding box).

  • If the IoU = 0: no overlap between the predicted and ground truth boxes.
  • If the IoU = 1: indicates a perfect overlap.

In terms of determining confidence, if the IoU is above a specific threshold (e.g., 0.5), it’s considered a true positive (correct). If the IoU is below that point, it’s considered a false negative (incorrect). 

4 -  Non-Maximum Suppression (NMS):

After predictions are made, the model applies NMS to filter out overlapping bounding boxes.

The final output consists of the bounding boxes with the highest confidence scores, along with the predicted class labels and their probabilities, effectively identifying the objects within the image.

What are the advantages and disadvantages of YOLO models?

Why is YOLO so popular?

The key advantage of YOLO is its speed. Since it only requires one pass to detect objects, it can process images or video streams quicker than other models.

This makes it effective for real-time applications where speed is critical, like traffic monitoring, sports analytics, and surveillance.

Let’s say you’re in an autonomous vehicle and driving down a busy street, full of cars and pedestrians. Within one go, the model will identify bounding boxes and its confidence score—its certainty that a particular bounding box contains an object—in real-time.

A diagram showing A look at how YOLO object detection models work in a single pass through the network.
A look at how YOLO object detection models work in a single pass through the network.

This means that if there are any sudden changes, the vehicle will identify it and localize it, allowing it to make any necessary adjustments in a time sensitive manner.

Outside of speed, YOLO is well loved because of its:

  • Efficiency: It’s able to reduce compute power required compared to models that use multi-stage pass.
  • High accuracy: While the accuracy may not be as strong as some two-stage algorithms, like R-CNN, YOLO is still found to deliver high accuracy for localizing objects. 
  • Open-source nature: A strong and active community drives continuous updates and enhancements to keep YOLO aligned with the latest advancements in object detection.
  • Hardware compatibility: It’s best suited for GPUs to handle heavy processing, but lighter versions can also run on powerful CPUs and certain single-board computers.
  • User-friendliness: With strong documentation, an active community, and a large set of pre-trained models, the YOLO model is considered easier-to-use than other object detection models.

What are the disadvantages of YOLO?

While YOLO is fast, this speed can come at the cost of accuracy, particularly in terms of localization.

Let’s say you were applying a YOLO model to detect faces in casinos, where high speed is crucial for preventing cheating and ensuring regulatory compliance. The bounding boxes might not always be perfectly tight, sometimes missing part of the face or including extra space.

An image showing people at a casino, with the YOLO model identifying faces. However, this image aims to show the inaccuracy that can sometimes happen with single-stage detectors. For example, the bounding box around some faces are cropped too small or too large compared to where the actual face is.
A look at how a YOLO model might be inaccurate at times.

This happens because YOLO doesn't go through multiple stages of refining and comparing similar boxes like a region proposal network would, potentially leading to minor localization inaccuracies.

Skipping multiple passes also leads YOLO models to struggle with:

  • Small objects: Its grid-based detection approach might not offer enough resolution for smaller objects, making them more difficult to detect. 
  • Detailed scenes: Complex environments or scenes with overlapping objects make it harder for YOLO to pinpoint objects accurately.

What are some examples of the YOLO model in use?

Autonomous vehicles

In applications such as self-driving cars or delivery robots, YOLO models can quickly detect and identify obstacles, pedestrians, or other vehicles. This fast detection is essential for making real-time navigation decisions to avoid collisions.

A YOLO model being used with Viam to detect cars. This image shows an image of cars driving near a mountain road. Bounding boxes are found around each of the 5 cars the model is locating.
A YOLO model being used on Viam to detect cars.

Sports analysis

YOLO models can track fast-moving objects, like hockey pucks or soccer balls, in sports. 

A YOLO model applied with Viam to detect where a soccer ball and players are located on a field.
A YOLO model being used on Viam to detect players and soccer balls.

This is useful for analyzing player performance, reviewing game footage, or even enhancing the viewing experience for fans.

Industrial robotics

YOLO can be used in manufacturing settings where robots need to detect and interact with objects on an assembly line. Its real-time capabilities ensure that robots can quickly and accurately perform tasks such as sorting, picking, and placing objects.

This image shows an image of an assembly line of donuts, with the YOLO model detecting each donut.
A YOLO model being used on Viam to detect donuts on an assembly line.

Security and surveillance

YOLO models are ideal for real-time object detection in environments like casinos, where monitoring for suspicious activity or identifying known individuals is critical. The speed of YOLO allows for immediate recognition and response.

Smart home systems

YOLO models can be integrated into smart home devices, such as security cameras or interactive assistants, enabling them to detect and respond to various objects or people within their field of view.

Healthcare

In medical imaging, YOLO can assist in identifying and localizing features such as tumors or other anomalies in real-time, providing doctors with immediate feedback during procedures. 

What are the different YOLO model versions?

YOLO has changed significantly since it was first developed in 2016. It’s undergone multiple iterations, with Ultralytics now leading its development. Each version brings improvements in speed, accuracy, and usability.

This timeline shows the evolution of YOLO models from YOLOv1 to YOLOv10.
This timeline shows the evolution of YOLO models from YOLOv1 to YOLOv10.

Here’s a quick breakdown of each model version: 

  • YOLOv1: Introduced the concept of single-pass object detection, processing images at high speeds but with some limitations in precision.
  • YOLOv2 (YOLO9000): Enhanced the model with higher resolution processing and batch normalization, improving the model’s ability to capture details and reduce variability. While YOLOv1 had 20 categories, YOLOv2 could detect 9,000 categories. 
  • YOLOv3: Further improved small object detection with multi-scale prediction (detecting the objects at three different scales) and a deeper Darknet-53 backbone, making the bounding box predictions more accurate.
  • YOLOv4: Focused on efficiency of the model, allowing object detection training on common GPUs with optimized training techniques and the CSPDarknet backbone.
  • YOLOv5: This model series, the first developed by Ultralytics, emphasized accessibility, with easy-to-use tools, Python and Pytorch support, and model scaling options (models that cater to different trade-offs between speed and accuracy) for various applications. 
  • YOLOv6: Introduced the EfficientRep Backbone and Rep-PAN Neck, making it more lightweight and efficient, especially for devices with limited compute power.
  • YOLOv7: Refined the architecture further with advanced layer aggregation techniques, making it one of the top-performing models in both speed and accuracy at its release.
  • YOLOv8: Revolutionized the model with architectural changes including anchor-free detection, mosaic data augmentation, and the C2f module, enhancing both accuracy and ease of use.
  • YOLOv9: Brought new model scaling techniques and architectural changes to improve efficiency and flexibility, catering to diverse needs in speed and accuracy.
  • YOLOv10: Continued to build on YOLOv9's advancements, further enhancing customization options and optimizing performance for different use cases.

Did you know that in a few minutes you can deploy YOLOv5 or YOLOv8 models onto your devices with Viam. See the below section “How can you apply YOLO models to your devices?

Which version of the YOLO model is best?

YOLOv9 and YOLOv10 have not been through as rigorous testing as YOLOv8, making YOLOv8 the more reliable choice for most users. This trend suggests that newer versions are still undergoing testing and may not yet be as stable. 

An infographic displaying YOLOv8’s usage in 2023. (source)
An infographic displaying YOLOv8’s usage in 2023. (source)

Additionally, the YOLOv8 model offers high accuracy and speed, a strong community, and is considered easy-to-use thanks to an intuitive API, pre-trained models, and out-of-the-box support for common tasks. 

How can you apply YOLO models to your devices?

Using Viam, deploying YOLOv5 or YOLOv8 models is simple. This is all thanks to the modules found in the Viam Registry and the open-source YOLO models in repositories, like Hugging Face, or those built by yourself.

A screenshot of the HuggingFace interface when YOLOv8 models are searched for.
A screenshot of the HuggingFace interface when YOLOv8 models are searched for.

Adding Hugging Face YOLO models

Once your device is set up in Viam, all you have to do is:

  1. Add in the YOLOv5 or YOLOv8 Vision Service module into your configuration.
  2. Head to Hugging Face and find the model of your choice
  3. Type in the model location in the JSON. This will tell the YOLO module where to point to. For example, the following tells the Vision Service to point toward the hard hat detection model in the HuggingFace repository.
{
"model_location": "keremberke/yolov8n-hard-hat-detection"
}
  1. Save the configuration.

Adding other YOLOv5 or YOLOv8 models

Remember, the YOLOv5 and YOLOv8 modules enable you to use any compatible YOLO models with your Viam machines. That means that even if the YOLO model is built by yourself or downloaded from another repository, the process is quite simple.

  1. As was done before, add in the YOLOv5 or YOLOv8 Vision Service module into your configuration.
  2. Point to the location of the applicable YOLO file on your device in the JSON.
  3. Save the configuration.

Tips for deploying YOLO models

If you're considering YOLO for a variety of projects, keep these deployment tips in mind:

  • Search for your ideal model on Hugging Face first: One of the great aspects of Hugging Face is the strong community behind it. There are many contributed open-source YOLO models trained on many categories—from airplanes to forklifts and even enemies in video games.
  • Look for YOLO models elsewhere: If you can’t find exactly what you’re looking for on HuggingFace, turn to other repositories, like Ultralytics, Kaggle, or Model Zoo.
  • Explore projects using YOLO models: If you’re unsure of what to build using YOLO models, get inspiration from others in the community.
  • Use YOLO models for image classification: There are some YOLO models that can also be used for image classification, assigning class labels to the images with a confidence score. Consider using them for projects where the location of the object is not needed.

A sign language Hugging Face YOLOv8 model being used with Viam’s YOLOv8 module by Coders Cafe.

Get started with YOLO models today

As you’ve seen, YOLO models are at the forefront of real-time object detection, offering great speed and versatility for a number of use cases, from sign-language detection to safety in vehicles and more.

Ready to get hands-on with YOLO and start building your own object detection solutions? Viam makes it easier than ever. Check out our tutorial on deploying YOLOv5 and YOLOv8 models using Viam's platform.

To learn more about object detection and its many applications, don't miss our comprehensive object detection guide.

Technically reviewed by: Khari Jarrett, Matt Vella, Nick Hehr

on this page

Get started with Viam today!