Semantic Segmentation Part 1: DeepLab-V3+

Our agenda for the Semantic Segmentation series is as follows:
- Part 1: Using DeepLab-V3
- Part 2: Training a U-Net
- Part 3: Transfer Learning with Mask R-CNN
- Part 4: State-of-the-Art (Summary)
Introduction
Semantic Segmentation describes the task to assign a semantic label to every pixel in an image or video. With the goal of having a real-time segmentation, I applied the open-sourced DeepLab model by Google, which is implemented in TensorFlow. The following improvements have been made to the model since its initial release in 2016:
- DeepLab-V1: Using atrous convolution to control the resolution of feature responses in CNNs. This is also known as dilated convolution and introduces another parameter, the dilation rate, to convolution layers, which spaces the convoluted pixels in a wider field of view, while still having the same number of weights
- DeepLab-V2: Using atrous spatial pyramid pooling (ASPP), which helps to account for different object scales and improves accuracy.
- DeepLab-V3: Adding image-level features to ASPP and applying batch normalization for easier training.
- DeepLab-V3 : Extension of DeepLabv3 by a decoder module to refine the segmentation results.

Architecture of DeepLab-V3 (from the related paper)
The architecture of the latest version of DeepLab (DeepLab-V3 ) is composed of two steps:
- Encoder: In this step, a pre-trained CNN extracts the essential information from the input image. For segmentation tasks, the essential information is the objects present in the image and their locations.
- Decoder: The extracted information from the encoding phase is used to create an output with the size of the original input image.
Implementation
The latest implementation of DeepLab supports multiple network backbones, like MobileNetv2, Xception, ResNet-v1, PNASNET and Auto-DeepLab.
To get the current DeepLab TensorFlow implementation, you have to clone the DeepLab directory from this GitHub project. It provides the code to train and evaluate the desired model.
In our case, we use the Xception network pre-trained on the Pascal VOC 2012 Semantic Segmentation benchmarks. Since in Novatec we have the possibility to access on a local GPU Server, I decided to use TensorFlow GPU and Keras.
So, let’s get started with programming! All my examples are based on the code published by the authors of the paper. Our application will be a real-time segmentation of your webcam stream. In this regard, make sure you already cloned the GitHub directory of DeepLab. First of all, we have to download the pre-trained model and save it into our models directory. The following code is an extension of the off-the-shelf jupyter notebook by the paper’s authors, with that you download your pre-trained model.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
#Some imports import collections import os import io import sys import tarfile import tempfile import urllib from IPython import display from ipywidgets import interact from ipywidgets import interactive from matplotlib import gridspec from matplotlib import pyplot as plt import numpy as np from PIL import Image import cv2 import tensorflow as tf sys.path.append('utils') import get_dataset_colormap #Download URLs of the pre-trained Xception model _MODEL_URLS = { 'xception_coco_voctrainaug': 'http://download.tensorflow.org/models/deeplabv3_pascal_train_aug_2018_01_04.tar.gz', 'xception_coco_voctrainval': 'http://download.tensorflow.org/models/deeplabv3_pascal_trainval_2018_01_04.tar.gz', } Config = collections.namedtuple('Config', 'model_url, model_dir') def get_config(model_name, model_dir): return Config(_MODEL_URLS[model_name], model_dir) config_widget = interactive(get_config, model_name=_MODEL_URLS.keys(), model_dir='') display.display(config_widget) _TARBALL_NAME = 'deeplab_model.tar.gz' config = config_widget.result #create directory model_dir = config.model_dir or tempfile.mkdtemp() tf.gfile.MakeDirs(model_dir) download_path = os.path.join(model_dir, _TARBALL_NAME) print('downloading model to %s, this might take a while...' % download_path) urllib.request.urlretrieve(config.model_url, download_path) print('download completed!') |
The implementation of the DeepLab model works as follows. We need a class to load the saved model and run the inference on a single image:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
# Class to load DeepLab model and run inference. class DeepLab(object) INPUT_TENSOR = 'ImageTensor:0' OUTPUT_TENSOR = 'SemanticPredictions:0' INPUT_SIZE = 513 # Creates and loads pretrained Deeplab model def __init__(self, tarball_path): self.graph = tf.Graph() graph_def = None tar_file = tarfile.open(tarball_path) for tar_info in tar_file.getmembers(): if _FROZEN_GRAPH in os.path.basename(tar_info.name): file_handle = tar_file.extractfile(tar_info) graph_def = tf.GraphDef.FromString(file_handle.read()) break tar_file.close() if graph_def is None: raise RuntimeError('Cant find graph.') with self.graph.as_default(): tf.import_graph_def(graph_def, name='') self.sess = tf.Session(graph=self.graph) # Run inference on a single image def run(self, image): # Args: PIL.image object width, height = image.size resize_ratio = 1.0 * self.INPUT_SIZE / max(width, height) target_size = (int(resize_ratio * width), int(resize_ratio * height)) resized_image = image.convert('RGB').resize(target_size, Image.ANTIALIAS) batch_seg_map = self.sess.run( self.OUTPUT_TENSOR, feed_dict={self.INPUT_TENSOR: [np.asarray(resized_image)]}) seg_map = batch_seg_map[0] # Output: RGB image resized from original input image, segmentation map of resized image return resized_image, seg_map |
Great! Only a few lines of codes and we implemented a powerful class for DeepLab. The next step would be to feed this class with test data to predict some labeled areas.
That means, the objects in the video, provided by the webcam stream, are segmented/colored, if they are in the Pascal VOC 2012 dataset.
Therefore I connected the DeepLab model with an OpenCV webcam stream. Easily done with this code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
_FROZEN_GRAPH = 'frozen_inference_graph' #Every time you run the code, a new model will be downloaded. Change the following line to a local path! model = DeepLab(download_path) cap = cv2.VideoCapture(0) final = np.zeros((1, 384, 1026, 3)) while True: ret, frame = cap.read() cv2_im = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) pil_im = Image.fromarray(cv2_im) # model resized_im, seg_map = model.run(pil_im) # color of mask seg_image = get_dataset_colormap.label_to_color_image( seg_map, get_dataset_colormap.get_pascal_name()).astype(np.uint8) frame = np.array(pil_im) r = seg_image.shape[1] / frame.shape[1] dim = (int(frame.shape[0] * r), seg_image.shape[1])[::-1] resized = cv2.resize(frame, dim, interpolation = cv2.INTER_AREA) resized = cv2.cvtColor(resized, cv2.COLOR_RGB2BGR) color_and_mask = np.hstack((resized, seg_image)) cv2.imshow('frame', color_and_mask) if cv2.waitKey(25) & 0xFF == ord('q'): cap.release() cv2.destroyAllWindows() break |
Results
If you want to test the webcam application, copy the Jupyter Notebook from my GitHub repository into your cloned DeepLab directory. When you run it, you can see a real-time segmentation of your webcam. Logically, the model only segments the objects on which it was trained for. In this case the PASCAL dataset contains, for example, the classes background, person, bottle, bicycle, boat, chair, tv monitor, sofa and many more.
When I ran the code, I got the following webcam result:

Webcam Segmentation with OpenCV
As you can see, the model segmented me as a person (class person: blue), the background (black) and my bottle (class bottle: purple). Unfortunately, the model also predicted the fire extinguisher in the background as a bottle, because it has a similar shape as a bottle.
A big advantage of DeepLab is its easy out-of-the-box usage, so there is no long training needed. Furthermore it is very flexibel, in respect to the various pre-trained models and datasets which are available. In fact, I recommend to use it as a test model and for feasibility analysis, because it provides fast results!
An overview of my code is available here. If there are any questions, I would be glad to hear from you.
Otherwise, I wish you lots of fun with segmenting some objects in front of your webcam!
References
„Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation„, Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, Hartwig Adam, Proc. ECCV’18
Artikel kommentieren
Aktuelle Beiträge






Kommentar
Dan McLane
Insightful and concise, very helpful in understanding DeepNet. Thanks!