Kevin Widholm
22. June 2018
8 min

Landmark Recognition Challenge - Development and Experiences

In this post, I want to share my experiences within the Landmark Recognition Challenge of Kaggle. Primarily this post gives you an overview of how to develop a Machine Learning model to recognize landmarks in images out of 15k classes. It is also for those who plan on participating in Kaggle Competitions like Landmark Recognition in the future. I will summarize my lessons learned at the end of this article. You can find the code in this repository.

The Landmark Recognition Challenge

So far, image classification challenges (for example ImageNet Large Scale Visual Recognition Challenge) were kept simple with a small number of classes and a lot of training examples per class. Landmark Recognition sets new standards in image classification with the largest worldwide dataset to date and challenges the participants to build a model that recognizes the correct landmark out of 15k possibilities in a test image dataset. In addition several classes only contain one or even a few training images. At first this sounds a bit more difficult than older challenges. But there are two further snags, which take the Landmark Recognition Challenge to the next level:

  • There are test images with no landmark.
  • There are test images with more than one landmark.

Furthermore, it would be better not to predict a landmark for a given image, if the prediction score is too low. I wrote my classification results into a submission csv file. This file should contain a header and have the following format:

Based on this file the submissions are evaluated using the Global Average Precision (GAP), also called micro Average Precision (microAP). The final ranking (in Kagglers language: „Leaderboard“) is created on the basis of the GAP value. If you would like to know more about the GAP metric, click here.

Data Preprocessing

For this competition I downloaded the following two datasets:

  • train.csv: 1.225.029 csv rows with URLs to train images labeled with their associated landmark
  • test.csv: contains  contains 117.703 csv rows with URLs to test images

As previously said, the test images may have either no landmark, one landmark or more than one landmark. The training images each depict one landmark. Each image in the training set contains a unique id, the URL and the labeled landmark id.

Train dataset with id, URL and landmark_id

In the test dataset each image contains a unique id and the image URL.

Test dataset with id and URL

As you can see, I have to download the images first, before I can use them. In the first trial I downloaded the original images in full resolution (over 350 GB of data). So I decided to resize them. After that, I stored them in a resolution of 128 x 128 pixel, which reduced the size to 15 GB.

The train and test images are available in the same size now. We end now our preprocessing step for the test data. But we even don’t know the label of each image in the train data. Therefore I created a script which prepares a directory for each landmark class and assigns the related images to these directories. Now we have 15.000 folders, one for each label.

Modeling with Keras and Transfer Learning

For my landmark recognition model I decided to classify the images using a Convolutional Neural Network (in this case: VGG16) pre-trained on the Google ImageNet dataset. The Keras libary includes different types of CNN architectures like the used VGG16. Two other types would be for example ResNet or Inception. They are trained to be able to recognize 1.000 different categories of everyday things, such as species of dogs, cats, various household objects, vehicle types and so on. The ImageNet dataset includes all these daily objects. In order to classify the landmark, we apply ‘Transfer Learning’ strategy, such as feature extraction and fine-tuning, and use a VGG16 net that was pre-trained on ImageNet dataset. During the training we froze the first 15 layers of the CNNs, so the network can learn the weights for the images outside the ImageNet dataset’s images.

VGG16 architecture (Source: toronto.edu )

The network architecture of VGG16 is described in the paper of Simonyan and Zisserman in 2014 (paper).

The ’16’ means that there are 16 layers connected in this CNN. By the way there’s also a VGG with 19 layers. Its name is VGG19, of course. The default input size for this model is 224 x 224. However, I changed the resolution to 128 x 128 pixels for better performance.

Bottleneck features

Now let’s jump into code and let’s see, which steps I have to complete before training the model. First of all we use the remaining portion of the model as a feature extractor called ‘Bottleneck Features‘ (i.e. the last activation maps before the fully-connected layers in the original model). After that I train a small fully-connected network on the bottleneck features, so we get the classes as outputs for our problem. The following code only exists to get the classes and save the corresponding bottleneck features of the VGG16. Furthermore I use an ImageDataGenerator to rescale the images (full code: here)

Train the top model

With the bottleneck features saved, now I am ready to train our top model. I define a function for that, called ‘train_top_model()’. I create a small fully-connected network using the bottleneck features as input. To successfully do fine-tuning you shouldn’t put a randomly initialized fully-connected network on top of a pre-trained CNN, because that would destroy the learned weights in the convolutional base. In consequence of this, we will only be able to start fine-tuning alongside a trained top-level classifier.

Fine-tuning

This model is not yet performing well, because the weights are still the weights of the ImageNet – in conclusion, so we have to fine-tune it. Alongside the top-level-classifier we fine-tune the last two convolutional parts and freeze the first 15 layer. First we have to load the old weights and build the convolutional base of the VGG16. After that we load the weights of our top-model previously defined and freeze the layers of the model up to layer 15. Additionally we use Data Augmentation for the training images (detailed information in Hauke’s Blog Post: here). A ‘fit_generator’ helps to feed the data to our RAM in batches. Without it you will get an ‘out-of-memory-error’.

Training

This part focuses on the usage of a GPU in this context. Training a CNN gets a big time boost with a GPU or even perhaps multiple GPUs. So for this Landmark Recognition Challenge I decided to access an EC2-Instance of AWS, which already has one Tesla K80 GPU integrated. At the end I got a validation accuracy of nearly 80 percent and a loss of about 0.9, what is a big success. Even though when you use a GPU the training takes several days. It’s nothing new that training a VGG16 is very challenging because of its deepness. When you combine this with a lot of data the training duration increases immeasurably. So imagine waiting until that network get trained on a CPU would be a tedious job 😀

Training of the CNN – terminal view

Prediction

Now we are ready to pass our test images through the network. For each image out of the 117.703 test images I have to recognize a landmark class from 1 to 15.000 and the related prediction score for this classification. Then I put everything together in a submission file, as I have already mentioned under the header ‘The Landmark Recognition Challenge’.

Which landmark?

It is important to note that this prediction step ignores the fact of not having any landmark to classify. First of all I predict the class label, which would be potentially the right solution. The differentiation if there is even a landmark or not, I will face in the next prediction step. However, in the following I show the code for predicting the landmark label:

In order to predict the landmark, we need to run it through the same pipeline as before. I load the trained model and the predict method returns the predictions of the Convolutional Neural Network. Subsequently I decode the predictions and map them to one of the classes saved in a map. And keep in mind to clear the Keras session after each run!

Is the prediction correct?

How we decide for or against the predicted landmark? The answer is called DELF (DEep Local Features). As the name suggests, it is a method to extract local features from images and compare them. In our case DELF helps to match two images containing the same landmark and to obtain local image correspondence. So it perfectly fits to landmark recognition. DELF was newly developed and introduced in this paper.

DELF architecture (Source: DELF-paper)

The architecture includes a mechanism that is trained to select the features with the highest scores (yellow). On the right side the DELF pipeline is used to find some matches between a query image and some database images. The index supports querying by retrieving nearest neighbor (NN) features. Additionally the image’s correspondences are figured out based on geometrically verified matches.

For example, the image below illustrates the visualized feature correspondences between two images with the same landmark. The specific ‘landmark’ should be in this case our NovaTec head office in Leinfelden-Echterdingen.

DELF matches – head office NocaTec

Implementation of DELF

It takes three steps to identify a predicted landmark in a test image. You can find them in my repository.

  1. Extract features: DELF extraction form an image list
  2. Find the matches: Comparasation of features to get the matches
  3. Decide the result: Decision as from how much correspondences (inliers) the landmark is true

After creating the DELF features the only thing we have to do now, is finding the matches between these features. As mentioned we do this with geometrical verification in Ransac. Through the returned ‘inliers’, I can measure the DELF correspondences. I set that value to an explicit number, which is 35. It means, that every test image which returns a value higher than 35 contains the predicted landmark.

When putting all together we get a pipeline as shown below.

Full landmark recognition pipeline

We start with classifying the test image into one landmark class. We check the given class with the DELF features by extracting the images out of the classified landmark folder. If 20 comparisons ran through or the threshold value of inlier (35) was exceeded, a result is returned. The result could be a verification of the classified landmark or no landmark.

Lessons Learned

During development of my landmark pipeline and my participation at the Kaggle Landmark Recognition Challenge I learned a few things.

The first point concerns the availability of my hardware architecture. It is important to have enough memory available and a powerful processor. You’re dealing with a lot of data, what means you have to store it somewhere. In my case I only had a memory space of 50 GB on my virtual machine, so I decided to minimize the resolution of my images. But with higher resolution the landmark recognizer might work better. One training step takes about three days on one GPU. If you have more GPUs available, you could multiprocess your application on these GPUs. Your training process will be faster and you could iterate more often to optimize your application.

A Kaggle Challenge takes about three months. For the full three months you should concentrate on optimizing your machine learning result. You will have a big disadvantage, if you decide to participate in a challenge from the middle of the duration. The other participants will always be one step ahead.

Don’t focus yourself too much on one model solution. If you try more than one solution, you can decide for the best one or put multiple techniques together. During my participation I focused on only one solution and tried to optimize it. Afterwards I think it would be better to also try other strategies and techniques.

Communicate with other Kaggle participants or participate in a team to share the ideas. Besides new ideas, another advantage would be the availability of more computing capacity. You can split the test dataset and deliver the results more quickly.

Conclusion

As you can see, in my first Kaggle Challenge I figured out a lot of new things. I hope you will apply these proposals in your Challenge, too. Kaggle Challenges have the significant advantage to enter into the subject of Machine Learning and get a little bit of financial support, if you do it well.

An overview of my code is here. If there are any questions, I would like to hear from you. Otherwise, I can only wish you a lot of fun in your next Kaggle Challenge!

References

“Large-Scale Image Retrieval with Attentive Deep Local Features”,Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand, Bohyung Han,Proc. ICCV’17

Comment article