06. September 2019
5 min

Semantic Segmentation Part 1: DeepLab-V3+

Welcome to the world of Semantic Segmentation! This post is the first part of my blog post series covering this specific topic. It consists of four posts and provides you with an overview about the most commonly used models in the field of Semantic Segmentation. In the last post, I will complete this subject with a recommendation for each model. I hope you'll enjoy it!

Our agenda for the Semantic Segmentation series is as follows:

  1. Part 1: Using DeepLab-V3+
  2. Part 2: Training a U-Net
  3. Part 3: Transfer Learning with Mask R-CNN
  4. Part 4: State-of-the-Art (Summary)

Introduction

Semantic Segmentation describes the task to assign a semantic label to every pixel in an image or video. With the goal of having a real-time segmentation, I applied the open-sourced DeepLab model by Google, which is implemented in TensorFlow. The following improvements have been made to the model since its initial release in 2016:

  1. DeepLab-V1: Using atrous convolution to control the resolution of feature responses in CNNs. This is also known as dilated convolution and introduces another parameter, the dilation rate, to convolution layers, which spaces the convoluted pixels in a wider field of view, while still having the same number of weights
  2. DeepLab-V2: Using atrous spatial pyramid pooling (ASPP), which helps to account for different object scales and improves accuracy.
  3. DeepLab-V3: Adding image-level features to ASPP and applying batch normalization for easier training.
  4. DeepLab-V3+: Extension of DeepLabv3 by a decoder module to refine the segmentation results.

Architecture of DeepLab-V3+ (from the related paper)

The architecture of the latest version of DeepLab (DeepLab-V3+) is composed of two steps:

  • Encoder: In this step, a pre-trained CNN extracts the essential information from the input image. For segmentation tasks, the essential information is the objects present in the image and their locations.
  • Decoder: The extracted information from the encoding phase is used to create an output with the size of the original input image.

Implementation

The latest implementation of DeepLab supports multiple network backbones, like MobileNetv2, Xception, ResNet-v1, PNASNET and Auto-DeepLab.

To get the current DeepLab TensorFlow implementation, you have to clone the DeepLab directory from this GitHub project. It provides the code to train and evaluate the desired model.

In our case, we use the Xception network pre-trained on the Pascal VOC 2012 Semantic Segmentation benchmarks. Since in Novatec we have the possibility to access on a local GPU Server, I decided to use TensorFlow GPU and Keras.

So, let’s get started with programming! All my examples are based on the code published by the authors of the paper. Our application will be a real-time segmentation of your webcam stream. In this regard, make sure you already cloned the GitHub directory of DeepLab. First of all, we have to download the pre-trained model and save it into our models directory. The following code is an extension of the off-the-shelf jupyter notebook by the paper’s authors, with that you download your pre-trained model.

The implementation of the DeepLab model works as follows. We need a class to load the saved model and run the inference on a single image:

Great! Only a few lines of codes and we implemented a powerful class for DeepLab. The next step would be to feed this class with test data to predict some labeled areas.

That means, the objects in the video, provided by the webcam stream, are segmented/colored, if they are in the Pascal VOC 2012 dataset.

Therefore I connected the DeepLab model with an OpenCV webcam stream. Easily done with this code:

Results

If you want to test the webcam application, copy the Jupyter Notebook from my GitHub repository into your cloned DeepLab directory. When you run it, you can see a real-time segmentation of your webcam. Logically, the model only segments the objects on which it was trained for. In this case the PASCAL dataset contains, for example, the classes background, person, bottle, bicycle, boat, chair, tv monitor, sofa and many more.

When I ran the code, I got the following webcam result:

Webcam Segmentation with OpenCV

As you can see, the model segmented me as a person (class person: blue), the background (black) and my bottle (class bottle: purple). Unfortunately, the model also predicted the fire extinguisher in the background as a bottle, because it has a similar shape as a bottle.

A big advantage of DeepLab is its easy out-of-the-box usage, so there is no long training needed. Furthermore it is very flexibel, in respect to the various pre-trained models and datasets which are available. In fact, I recommend to use it as a test model and for feasibility analysis, because it provides fast results!

An overview of my code is available here. If there are any questions, I would be glad to hear from you.

Otherwise, I wish you lots of fun with segmenting some objects in front of your webcam!

References

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation„, Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, Hartwig Adam, Proc. ECCV’18