This post is the conclusion of my blogpost series. I am presenting the state-of-the-art of Semantic Segmentation by providing you a short summary of each Semantic Segmentation approach.
Moreover, we will contrast the approaches of the last parts. If you have not read them yet, you should do so. You will find the blogposts under the following links:
Semantic Segmentation algorithms have solved several computer vision tasks with an increasing level of difficulty. In this manner many applications such as autonomous cars or facial recognition systems became possible.
Semantic Segmentation means not only assigning a semantic label to the whole image as in classification tasks. It‘s a more advanced technique that requires to outline the objects, and partitioning an image into multiple segments. In fact the problem of Semantic Segmentation is to find an irregular shape that overlap with the real shape of the detected object.
There are a few existing approaches for Semantic Segmentation, such as out-of-the-box solutions, training models from scratch and Transfer Learning. One general thing most of the architectures have in common is an encoder network followed by a decoder network:
- Standard pre-trained classification network (VGG, ResNet,…) as encoder
- Projection of the discriminative features onto the pixel space in order to get a dense classification as decoder
Fully Convolutional Network (FCN)
Fully Convolutional Networks (FCNs) owe their name to their architecture. There are only locally connected layers and no dense layer. The input images are changed to a fitting size using some convolutions and max-pooling layers. After the prediction of the class label it uses up sampling and deconvolution layers to resize to the original size again. It follows that the output image is the same size as the original image. By merging features from various resolution levels we are able to recover fine grained spatial information lost in down sampling .
Unfortunately I have no example for the FCN model, because FCN is not so powerful as other discussed models and serves as basic information.
The architecture of the latest version of DeepLab (DeepLab-V3 ) is composed of two step. In the encoder step, a pre-trained CNN extracts the essential information from the input image. For segmentation tasks, the essential information is the objects present in the image and their locations. In the decoder step, the extracted information from the encoding phase is used to create an output with the size of the original input image.
For further information, please read the DeepLab blogpost.
As an extension of the FCN, the U-Net architecture is built to yield better segmentation. In fact the architecture can be separated in downsampling and upsampling. The first part is the contraction path which is used to capture the context in the image. The second part is the symmetric expanding path which is used to have precise localization using transposed convolutions.
For detailed information, please read the U-Net blogpost.
Mask R-CNN takes a different approach as the encoder-decoder structure. It is an extension of Faster R-CNN, which is used for object detection. Mask R-CNN adds a branch for predicting segmentation masks on each detected image or each Region of Interest. Besides the class label and bounding box coordinates, it returns the mask for each object.
Mask R-CNN has two different stages – a region proposal network (RPN) and binary mask classifier. The first stage proposes the locations on the image where is an object with highest probability. The second stage predicts the class of the proposed object by generating a mask in pixel level.
For detailed information, please read the Mask R-CNN blogpost.
|Related paper||U-Net paper||Mask R-CNN paper|
|Network structure||Encoder-Decoder on top of CNN backbone||Symmetric upsampling-downsampling||2 stages architecture|
||TensorFlow, Pytorch, Keras, MXNet,…||TensorFlow only, Keras with TensorFlow or Theano back-end, black box TensorFlow model, Theano, MXNet, Caffe,…||TensorFlow, PyTorch, Keras, Caffe, MXNet, Theano,…|
|Used Library||Keras with TensorFlow backend||TensorFlow GPU and Keras||TensorFlow GPU and Keras|
|Transfer Learning||Google provides a convenient interface to use pre-trained models and to retrain using Transfer Learning.||Due to the flexibility of encoder-decoder networks, we are able to replace the encoder with your favorite pre-trained net.||One of the advantages of Mask R-CNN is that it can easily transfer into a bespoke solution for your special problem by using a pre-trained model.|
|My used datasets||Pascal VOC 2012||Carvana dataset||RSNA Pneumonia Detection Challenge dataset|
|Challenges||The main challenge of the DeepLab implementation was the out-of-the-box aspect, so that further code implementation is more difficult. Sometimes there are some errors in the assigned classes, but this could be improved by training the model on a specific dataset.||Because of many layers the U-Net model takes an amount of time to train. Furthermore we have to watch for the right up- and downsampling parameters. Another challenge of U-Net is that it is not standard to have pre-trained models widely available, since the use cases are often too task specific.||Mask-RCNN’s implementation is harder since it employs a two-stage learning approach, where you first optimize the Region Proposal Network and then predict bounding boxes, classes and masks simultaneously. Mask R-CNN is computationally expensive and real-time applications are very difficult.|
|Advantages||The DeepLab model combines several powerful concepts of Deep Learning, which are constantly improved by Google. DeepLab is very accurate and speedy, that is why it is ideally suited for real-time Semantic Segmentation. Due to the easy implementation and the convenient interfaces to retrain using Transfer Learning, this model is a good alternative for solving Semantic Segmentation problems.||The U-Net model is the currently most used model. This probably has to do with the easily understandable architecture. Even a small number of data achieves good results. In such use cases, where the number of annotated samples is limited, massive data augmentations are possible. Moreover, U-Net has no dense layer, so images of different sizes can be used as input. An advantage of U-Net is that it is faster to run than Mask R-CNN.||The major advantage of Mask R-CNN is the easy usage of Transfer Learning. One can easily transfer into a bespoke solution for special problems by using a pre-trained model. In addition, multi GPU training is available and it can be trained end-to-end.|
|Applications||Particularly usable for high performance and real-time applications.||Very suitable for bio medical applications with less data.||Usage of Transfer Learning with Mask R-CNN, when pre-trained weights for the specific task already exists.|
DeepLab-V3 is particularly suitable for feasibility checks, because it is easy to implement and provides fast results. You can see, in my example implementation, DeepLab works also well for real-time applications.
U-Net is the most popular Semantic Segmentation model. The application of U-Net is advisable for use cases with small data, like bio medical problems, for instance. Furthermore, I recommend to use it as production model, because it provides flexible customization options.
Last but not least, I recommend to use Mask R-CNN as alternative solution to U-Net and especially if Transfer Learning makes sense. Transfer Learning is reasonable if pre-trained weights already exists for the specific use case. Of course, it is possible to apply Transfer Learning with DeepLab or U-Net, too.
Please note that this is only my assessment and recommendation from my practical experiences. Certainly there is to solve even more possibilities around the Semantic Segmentation problem, beyond the solutions reported in my blogpost series!
If you enjoyed my blogpost series, share it with your friends and colleagues!