Kevin Widholm
26. September 2019
6 Min.

Semantic Segmentation Part 3: Transfer Learning with Mask R-CNN

It has been nearly a decade, since Deep Learning became feasible and integral to many widely used software applications. There is now the possibility to transfer and leverage our knowledge from what we have learned in the past! I'm talking about Transfer Learning. In this post, we will go one step further by reusing a model developed for another task as the starting point for our Semantic Segmentation task.

In this third post of Semantic Segmentation series, we will dive again into some of the more recent models in this topic – Mask R-CNN. Compared to the last two posts Part 1: DeepLab-V3 and Part 2: U-Net, I neither made use of an out-of-the-box solution nor trained a model from scratch. Now it is the turn of Transfer Learning!

If you did not hear anything about Transfer Learning yet, you should read this blog post: click here!

The post is organized as follows: First I explain the Mask R-CNN architecture in an introduction, give an overview of the example application and present my implementation.


Mask R-CNN combines elements of object detection and Semantic Segmentation. The goal of object detection is a bounding box classification, and in Semantic Segmentation we predict classes for each pixel. The result is the so-called instance segmentation. Consequently Mask R-CNN takes a different approach as the already known encoder-decoder structure of previous models (DeepLab and U-Net).

As an extension of Faster R-CNN, which is used for object detection, Mask R-CNN adds a branch for predicting segmentation masks on each detected image or each Region of Interest (Figure 2). Besides the class label and bounding box coordinates, it returns the mask for each object.

To better understand how Mask R-CNN works, I will shortly describe the idea behind Faster R-CNN (Figure 1), which contains the following elements:

  • Basic ConvNet: Feature maps extraction from the images.
  • Region Proposal Network (RPN): Transformation of the input feature maps into candidate bounding boxes.
  • Region of Interest (RoI) pooling layer: Conversion of all the candidates into the same size.
  • Fully connected layer: Classification and bounding boxes.

Figure 1: Faster R-CNN (related paper)

The main difference to Faster R-CNN is a object mask extension at the end of the process. The following elements will be changed in Mask R-CNN:

  • Backbone Model: Instead of ConvNet we use ResNet 101 architecture to extract features.
  • Region Proposal Network (RPN): Prediction of regions containing an object.
  • Region of Interest (RoI): In addition to convert the regions to the same shape and bounding box prediction, Mask R-CNN generates an object mask.
  • Segmentation Mask: By the computation of Intersection over Union (IoU) a mask branch is added to the existing architecture and returns a mask for each region.

Figure 2: Mask R-CNN framework for instance segmentation (related paper)

Instead of RoI pooling layer, the image gets passed through RoIAlign so that the regions of the feature map generated by RoI pooling function correspond better to the regions of the input image. This causes a more fine-grained alignment what is necessary for pixel level segmentation.

Example Application

To illustrate the practical use of Mask R-CNN in combination with Transfer Learning, I found an interesting application for instance segmentation. It’s the Kaggle Challenge: RSNA Pneumonia Detection Challenge.

The company behind this challenge is Radiological Society of North America (RSNA®), an international society of radiologists, medical physicists and other medical professionals. They see the potential for Machine Learning to solve the problem of detecting pneumonia in a simpler and automated way. Common diagnosing pneumonia requires review of a chest radiograph (CXR) by qualified experts. The disease pattern often manifests as an area of opacity on CXR, but there are many other issues playing a role.

So in this competition, the participants are called to support clinical institutions by building a Machine Learning algorithm in order to locate lung opacities on chest radiographs. Our model will be applied to segment all the the regions in the medical images and predicts the probability for a pneumonia. In other words the objective is to detect and draw a bounding box on each of the pneumonia opacities. Each image can have zero or many opacities. The provided datasets contain the training set, which is already classified, and the testing set, which has to be predicted in the format:

confidence x-min y-min width height.


Let’s start programming! You can find the whole code in the linked GitHub repository. I will briefly describe the steps implementing a possible solution. The main goal of this implementation is on the one hand to demonstrate Transfer Learning with Mask R-CNN in use, and on the other hand to get some prediction results of pneumonia. Hence, I refrained from calculating an evaluation score which, in this competition, would be the so-called mean average precision at different intersection over union thresholds.

1. Import the requirements and check GPU usage:

2. Download the dataset and Mask R-CNN repository:

The competition has two stages. We will only focus on stage two. (Why two stages?)

Train images: “”

Test images:””

Training data: “stage_2_train_labels.csv”

Sample submission file: “stage_2_sample_submission.csv,”

A file with detailed information about the positive and negative classes on the training set: “stage_2_detailed_class_info.csv”

Put the downloaded files into a directory and clone the following GitHub repository of Mask R-CNN:

3. Import Mask R-CNN and get COCO weights:

We use the pre-trained COCO weights for our Mask R-CNN model to realize Transfer Learning. The weights will be our baseline for further training process!

4. Choose parameter:

The following parameters are not optimal. They were chosen to minimize the running time for demonstration purposes. To optimize them we could implement Hyperparameter Tuning, for example with GridSearch or RandomSearch algorithms.

5. Display a random image with bounding box:

Figure 3: Random image with bounding box

5. Train the model:

Now it’s time to train the model. Note that training even on a GPU can take a few hours or days. Therefore I limited the training to a few epochs!


After I trained the model on our GPU server with the declared parameter and different sequences of training settings, for example decreasing learning rate for each epoch, I got the following results:

As you can recognize, the model has predicted some lung opacities as pneumonia and put a bounding box around them. Some images have no bounding box or even multiple bounding boxes.

With the use of Transfer Learning the training time of the model has been reduced. In this way, we were able to achieve results quickly. Mask R-CNN in connection with Transfer Learning offers a valid alternative for Semantic Segmentation tasks. In my next post, I will compare this approach with the previous models U-Net and DeepLab.

Until then I wish you much fun with implementing your own Semantic Segmentation application.

Which is your favorite model so far?

If you enjoyed this blog post and haven’t read the others, you might also enjoy:

Artikel kommentieren


  1. José Miguel

    could you share the augmentations used?

    • Harald Bosch

      Dear José,

      yes, you can find the whole code in the Github repository linked in the section „Implementation“. But in short, it is a combination of
      simple image image adjustments (scaling, translating, rotating, shearing, brightening, contrast normalizing, blurring and sharpening).