background-shape
feature-image

YOLOv7

YOLOv7 is the latest iteration from the object detector You Only Look Once. All the theoretical information can be found in the article YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors by Chien-Yao Wang, Alexey Bochkovskiy, Hong-Yuan Mark Liao. The critical information is that it surpasses all known object detectors in terms of accuracy and rapidity. YOLOv7 is capable of object detection and segmentation.

Custom dataset

For training YOLOv7 with a custom dataset, we need YOLOv7 (branch u7 for segmentation), a dataset in the correct format, a dataset.yaml, and a yolov7-seg.yaml configuration file.

We will have the following folder architecture:

model/
├─ yolov7/
├─ dataset.yaml
├─ yolov7-seg.yaml 
├─ processed_dataset/
│  ├─ train/
│  │  ├─ images/
│  │  ├─ labels/
│  ├─ val/
│  │  ├─ images/
│  │  ├─ labels/

For segmentation, a label file is associated with each image and contains the class number and coordinates of a polygon to create the segmentation mask, one line per object. The coordinates need to be normalized between 0 and 1 (x divided by the image width and y divided by the image height).

class_number x1 y1 x2 y2 x3 y3
class_number x1 y1 x2 y2 x3 y3

We create two folders, one for training processed_dataset/train and one for validation processed_dataset/val. Inside these folders, we have again two folders, images and labels, where we put the images 0000.jpg, 0001.jpg... and the labels 0000.txt, 0001.txt... associated only by name.

We create the dataset.yaml file that contains the information about the dataset (the relative paths are explained in the training section below), in this example only with two classes:

train: ../../processed_dataset/train
val: ../../processed_dataset/val
nc: 1
names:
  0: cat

And finally, we create the yolov7-seg.yaml file containing the network information. The main point is to set the class number nc with the correct number.

# YOLOv7

# Parameters
nc: 1  # number of classes
depth_multiple: 1.0  # model depth multiple
width_multiple: 1.0  # layer channel multiple
anchors:
  - [12,16, 19,36, 40,28]  # P3/8
  - [36,75, 76,55, 72,146]  # P4/16
  - [142,110, 192,243, 459,401]  # P5/32

# YOLOv7 backbone
backbone:
  # [from, number, module, args]
  [[-1, 1, Conv, [32, 3, 1]],  # 0
  
   [-1, 1, Conv, [64, 3, 2]],  # 1-P1/2      
   [-1, 1, Conv, [64, 3, 1]],
   
   [-1, 1, Conv, [128, 3, 2]],  # 3-P2/4  
   [-1, 1, Conv, [64, 1, 1]],
   [-2, 1, Conv, [64, 1, 1]],
   [-1, 1, Conv, [64, 3, 1]],
   [-1, 1, Conv, [64, 3, 1]],
   [-1, 1, Conv, [64, 3, 1]],
   [-1, 1, Conv, [64, 3, 1]],
   [[-1, -3, -5, -6], 1, Concat, [1]],
   [-1, 1, Conv, [256, 1, 1]],  # 11
         
   [-1, 1, MP, []],
   [-1, 1, Conv, [128, 1, 1]],
   [-3, 1, Conv, [128, 1, 1]],
   [-1, 1, Conv, [128, 3, 2]],
   [[-1, -3], 1, Concat, [1]],  # 16-P3/8  
   [-1, 1, Conv, [128, 1, 1]],
   [-2, 1, Conv, [128, 1, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [[-1, -3, -5, -6], 1, Concat, [1]],
   [-1, 1, Conv, [512, 1, 1]],  # 24
         
   [-1, 1, MP, []],
   [-1, 1, Conv, [256, 1, 1]],
   [-3, 1, Conv, [256, 1, 1]],
   [-1, 1, Conv, [256, 3, 2]],
   [[-1, -3], 1, Concat, [1]],  # 29-P4/16  
   [-1, 1, Conv, [256, 1, 1]],
   [-2, 1, Conv, [256, 1, 1]],
   [-1, 1, Conv, [256, 3, 1]],
   [-1, 1, Conv, [256, 3, 1]],
   [-1, 1, Conv, [256, 3, 1]],
   [-1, 1, Conv, [256, 3, 1]],
   [[-1, -3, -5, -6], 1, Concat, [1]],
   [-1, 1, Conv, [1024, 1, 1]],  # 37
         
   [-1, 1, MP, []],
   [-1, 1, Conv, [512, 1, 1]],
   [-3, 1, Conv, [512, 1, 1]],
   [-1, 1, Conv, [512, 3, 2]],
   [[-1, -3], 1, Concat, [1]],  # 42-P5/32  
   [-1, 1, Conv, [256, 1, 1]],
   [-2, 1, Conv, [256, 1, 1]],
   [-1, 1, Conv, [256, 3, 1]],
   [-1, 1, Conv, [256, 3, 1]],
   [-1, 1, Conv, [256, 3, 1]],
   [-1, 1, Conv, [256, 3, 1]],
   [[-1, -3, -5, -6], 1, Concat, [1]],
   [-1, 1, Conv, [1024, 1, 1]],  # 50
  ]

# yolov7 head
head:
  [[-1, 1, SPPCSPC, [512]], # 51
  
   [-1, 1, Conv, [256, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [37, 1, Conv, [256, 1, 1]], # route backbone P4
   [[-1, -2], 1, Concat, [1]],
   
   [-1, 1, Conv, [256, 1, 1]],
   [-2, 1, Conv, [256, 1, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [[-1, -2, -3, -4, -5, -6], 1, Concat, [1]],
   [-1, 1, Conv, [256, 1, 1]], # 63
   
   [-1, 1, Conv, [128, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [24, 1, Conv, [128, 1, 1]], # route backbone P3
   [[-1, -2], 1, Concat, [1]],
   
   [-1, 1, Conv, [128, 1, 1]],
   [-2, 1, Conv, [128, 1, 1]],
   [-1, 1, Conv, [64, 3, 1]],
   [-1, 1, Conv, [64, 3, 1]],
   [-1, 1, Conv, [64, 3, 1]],
   [-1, 1, Conv, [64, 3, 1]],
   [[-1, -2, -3, -4, -5, -6], 1, Concat, [1]],
   [-1, 1, Conv, [128, 1, 1]], # 75
      
   [-1, 1, MP, []],
   [-1, 1, Conv, [128, 1, 1]],
   [-3, 1, Conv, [128, 1, 1]],
   [-1, 1, Conv, [128, 3, 2]],
   [[-1, -3, 63], 1, Concat, [1]],
   
   [-1, 1, Conv, [256, 1, 1]],
   [-2, 1, Conv, [256, 1, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [[-1, -2, -3, -4, -5, -6], 1, Concat, [1]],
   [-1, 1, Conv, [256, 1, 1]], # 88
      
   [-1, 1, MP, []],
   [-1, 1, Conv, [256, 1, 1]],
   [-3, 1, Conv, [256, 1, 1]],
   [-1, 1, Conv, [256, 3, 2]],
   [[-1, -3, 51], 1, Concat, [1]],
   
   [-1, 1, Conv, [512, 1, 1]],
   [-2, 1, Conv, [512, 1, 1]],
   [-1, 1, Conv, [256, 3, 1]],
   [-1, 1, Conv, [256, 3, 1]],
   [-1, 1, Conv, [256, 3, 1]],
   [-1, 1, Conv, [256, 3, 1]],
   [[-1, -2, -3, -4, -5, -6], 1, Concat, [1]],
   [-1, 1, Conv, [512, 1, 1]], # 101
   
   [75, 1, Conv, [256, 3, 1]],
   [88, 1, Conv, [512, 3, 1]],
   [101, 1, Conv, [1024, 3, 1]],

   [[102, 103, 104], 1, ISegment, [nc, anchors, 32, 256]],  # Detect(P3, P4, P5)
  ]

Training using Google Google Colaboratory

Training deep learning models without GPU can be time-consuming or even impossible. Alternatives exist by renting a GPU instance from a provider (OVH, AWS, Google, etc.). A simpler option to tinker with a small model is Google Colaboratory. Using Google Colaboratory is relatively straightforward. The first step is to upload the model folder to your Google Drive. The second step is to create a Colab notebook, set a GPU for processing using Edit, Notebook settings, and choosing GPU.

Inside the notebook, we first mount the Google Drive:

from google.colab import drive
drive.mount('/content/drive')

Then install the dependencies for YOLOv7:

import os
os.chdir('/content/drive/MyDrive/model/')
!pip install -r requirements.txt

To train the model, we use the provided command that we can find inside the yolov7/seg/segment/ folder. We execute the command from the topmost directory, and the relative paths to the dataset in the configuration file are set from the train.py file path.

!python yolov7/seg/segment/train.py --data dataset.yaml --batch 8 --weights '' --cfg yolov7-seg.yaml --epochs 20 --name yolov7-seg_colab_ --hyp yolov7/seg/data/hyps/hyp.scratch-high.yaml

Google Colaboratory enforces a limitation on GPU usage and runtime time. You can circumvent these limitations by training in several steps adding --resume to the training command to continue the training, or by subscribing to a Colaboratory premium. The resulting weights can be downloaded from Google Drive (yolov7/seg/runs) and used for inference.

Figure.3 - Segmentation prediction (source image: https://www.vecteezy.com/free-photos)