user-icon Kevin Widholm
13. September 2019
timer-icon 4 Min.

Semantic Segmentation Part 2: Training U-Net

At Novatec, we prefer to take a look on what is under the tip of the iceberg and we were therefore interested in reviewing different approaches of Semantic Segmentation. In my last post, I gave an insight into solving a segmentation problem using the pre-trained model DeepLab-V3+. Do you also think that it is now time to train a model from scratch?

So, after the out-of-the-box solution of the blogpost Semantic Segmentation Part 1: DeepLab-V3 , this post is about training a model from scratch!

The post is organized as follows: I first explain the U-Net architecture in a short introduction, give an overview of the example application and present my implementation.


Many use cases in the field of Semantic Segmentation require a classification at pixel level, which can only achieved with an individual training of the model. Nowadays, customers expect to be able to label thousands of objects in an image or video. Therefore it is important for the model to understand the context in the environment in which they are operating. In such a case, the model and its training data have to be adjusted at the very beginning. Considering these requirements, it is more efficient to train a model from scratch.

U-Net comes into play for complex image segmentation problems. Similar to the architecture of DeepLab, it consists of an encoder and decoder. More particularly, the whole idea of U-Net is a further development of Convolutional Neural Networks (CNNs),  which learns the feature mapping of images in order to make more nuanced feature mapping. However, the CNN approach doesn’t works well in image segmentation, where we also need to reconstruct an image from a vector. The decoder of U-Net is responsible for this reconstruction task. During the encoder phase, we have already learned the feature mapping. We use the same feature maps that are used for contraction to expand a vector to a segmented image. Without going further in-depth, the structure of the U-Net model looks like an „U“, what owes its name.

U-Net architecture (related paper)

Example Application

To test the U-Net model in practice, I searched for a challenging application which requires a lot of computing power. The choice fell on the Kaggle Challenge: Carvana Image Masking Challenge.

Behind this challenge stands Carvana, an online used car startup. Their innovation is a custom rotating photo studio that automatically captures and processes standard images of each vehicle in their inventory. The challenge was to develop a machine learning algorithm that automatically removes the photo studio background – a case for applying Semantic Segmentation! This will allow Carvana to superimpose cars on different backgrounds. The application is also applicable on many other businesses of sales industry. For example shoppers who want to know everything about the products or sellers who want to present their products in the most attractive way.

The objective of this competition was to create a model for binary segmentation of car images. In other words,  we have to identify the boundaries of a car in an image in order to cut it out and segment the pixels of the car object. Additionally the participants of the challenge have access to the Carvana dataset, which consists of one training and one testing set.

Training examples: car images and their masks

  • Training set: 5088 images : 16 angles from 318 vehicles, 5088 masks (404 MB)
  • Testing set: 100064 images: 16 angles from 6254 vehicles (7.76 GB)
  • All images are 1280 x 1918
  • Metadata of each cars: Year, Make, Model, Trim

Perhaps you have already noticed that the training set is really small in comparison to the testing set. So the training images are not enough. The training masks have been created by humans. As in most human tagged datasets, this is not fully consistent. 


Let’s dive into programming! You can find the whole code in the linked GitHub repository. I deployed my application on our Novatec GPU server using TensorFlow-GPU and Keras. The score to be evaluated in this competition is the Dice Coefficient, which is a value between 0 and 1, where 1 is the best value.

Since our net will be quite deep to be trained on a CPU, we have to check the correct usage of our GPU.

As the training dataset is much smaller than the test dataset, one of the most efficient improvements would be to augment it by performing some transformations. Not yet heard of Image Augmentation? Here’s an interesting blogpost by one of my colleagues: Image classification with CNNs.

So I applied random hue saturation, random shift scale rotation and random horizontal flip. With this approach, the model is able to generalize better to fit the testing data. As a result, the training dataset has been extented by the following adjustments:

Impact of random hue saturation

Impact of random shift scale rotate

The implementation of the U-Net architecture works as follows. The output from the network is a 1024*1024*3, which represents the mask that should be learned. Sigmoid activation function makes sure that the mask pixels are between 0 and 1.

If you are interested in the whole source code, you will find it in my GitHub repository: here.


After I trained the model for 30 epochs on our GPU server (took 2 days), the trained model is able to predict masks almost identical to the original ones. The biggest input image that our GPU can handle is 1024x1024x3 with a batch size of 1.  With this result I achieved a Dice score of 0.9953, just a few points of difference to the winning team of this competition (Dice score of 0.9973).

My predicted masks with test images

That’s it! I hope you enjoyed this blogpost. There is an evaluation of all the models in the last post of this Semantic Segmentation series.

If you have any thoughts, questions or suggestions please share them with me.