13. January 2020
5 min

Capsule Networks - Better CNNs?

Convolutional Neural Nets are state of the art in image classification, but what are Capsule Networks and why do they might perform better than CNNs?

The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster.

This is a quote from Geoffrey Hinton, considered to be one of the most import researchers on AI and Deep Learning. In 2012, his student Alex Krizhevsky won the ImageNet challenge with the Convolutional Neural Network (CNN) AlexNet and layed ground for the success of CNNs in Computer Vision. So why does Hinton think that CNNs are bad?

What CNNs are bad at:

If you want to recap CNNs, I can recommend this article or take a look at this blogpost for a CNNs used for image classification.
Although CNNs achieve the best results in image classification, they are actually pretty bad in detecting objects in different poses. To achieve good performance a lot of training data is needed, and techniques like data augmentation are used.
This is because CNNs are not equivariant but invariant to pose changes. Equivariance means, that when the input changes, the output changes the same way. Invariance in contrast means, that when the input changes, the output doesn’t change. Because CNNs are made to be invariant, this makes them robust against translation changes. But other pose changes (like rotation, orientation) are not handled well.
This is because CNNs use pooling (e.g. max-pooling) to convey information from one layer to the next. CNNs detect certain features in an image, but through using a pooling layer valuable information gets lost.

Max Pooling Layer in CNN

Max Pooling operation

What do we want instead:

What we want instead is equivariance. When the input changes, the output should change accordingly.

Instantiation parameter

As the boat is rotated the vectors/capsules on the right hand side rotate as well. The orientation of the vector represents the parameters describing the object.

So how do Capsule Nets work?

If you are more hands on and want to immediately see an implementation, check out this jupyter notebook, in which I implemented a capsule network with pytorch.

In computer graphics, a picture is rendered from a lot of stored parameters. Capsule networks are going the other way round:  they infer instantiation parameters from an input image. So instead of only providing us a probability (as CNNs do), the output of a capsule network is a vector, Each capsule outputs a vector, where the length of the vector reflects the probability that an specific object is present, while the orientation encodes the instantiation parameters.

Activation of capsules

We have a triangle (blue) and a rectangle (red) capsule. We can see that activation (reflected by the length of the capsule vector) is higher where a rectangle and a triangle exits.

In the previous picture there were two different capsules, each is responsible for detecting a specific shape or object. Additionally capsule networks have a hierarchy: in lower levels simple shapes, whereas more complex objects are detected in higher levels.

Hiearchy of capsules

The boat and rectangle capsule report to the boat capsule. There could be many more low and high level capsules.

The instantiation parameters encoded in the output vector can be:

  • scale
  • skewness
  • size
  • color
  • orientation

Lets take a deeper look at Capsules

We have seen that capsule networks output vectors which store instantiation parameters.
Lets compare “classical” neurons and capsules:

The main differences here are

  • capsules work with vectors instead of scalar
  • they have an activation function called squashing

Squashing is a special kind of normalization, i.e. it rescales the vector to a value between 0 and 1 while preserving its orientation. Thereby short vectors get shrunk to almost zero length, while long vectors get rescaled to values slightly below 1.

Layout of a capsule

Another difference between CNN’s and Capsule Networks is the use of routing by agreement instead of pooling. Routing by agreement is the algorithm introduced in 2017 which made it possible to implement performant Capsule Networks.

How does routing by agreement work?

We have seen that there is a hierarchy: lower level capsules report to higher level capsules. Routing by Agreement determines which lower level capsule reports to which higher level capsule.
Lower level capsules are multiplied with a affine transformation matrix W and try to predict what higher level capsules will output. Those matrices W are calculated with classic backpropagation. Through routing by agreement, the coupling coefficients c~i~ are determined. As seen before, the coupling coefficients give us a weighted sum of the transformed input. The higher c~i~, the more important the input from the lower level capsule.

Routing by agreement

Both the triangle and rectangle capsule agree on predicting a boat. So the coupling coefficients to the boat capsule get increased, while the coefficients to the house capsule get decreased.

For more detail, we take a look a the routing algorithm from the paper and go through it, step-by-step:

Routing by Agreement Algorithm

Routing by Agreement Algorithm (Sabour et al., 2017)

First, all coefficients are initialized with 0. In each iteration we normalize the weights using softmax and then compute the weighted sum sj of all lower level capsules for a high level capsule j.
After normalizing the sum with the squash function, we compute the difference between the normalized weighted sum vj and each transformed lower level capsule input ûi. The more alike vj and ûi, i.e. the more ûiagrees with vj, the bigger the product. This is then added to the current coupling coefficient. So step by step, coupling coefficients where capsules agree get increased.

Here is an example implementation with pytorch (you can find the whole implementation in this jupyter notebook):

What makes routing by agreement special?

On the left hand side only a boat is detected, which leaves the remaining rectangle and triangle unexplained. But on the right hand side, all shapes can be matched with an object.

It is good at handling crowded scenes, by “explaining away” ambiguities. What does that mean?
In the image above, we can see either a house and a boat, or only a house with a triangle on top and a rectangle below.
As detecting both a house and a boat fits better to the predictions made by the lower level, the ambiguity is resolved.

Architecture and Implementation

If you are still interested in learning more about Capsule Networks and implementing it yourself, keep on reading!

This is the architecture used in the paper from Sabour et al.:

Architecture of the encoder part of a capsule network.

This is the encoder part. We start with a convolutional layer with 256 kernels, resulting in a 20x20x256 output. The next layer is called primary caps, but is acutally only another convolution. With the second caps layer we have a proper capsule layer as we have seen before. Each capsule gets transformed with a affine transformation matrix and then dynamic routing takes place, where it is determined which lower level capsule reports to which higher level capsule.
Afterwards the encoder part, a fully connected network is used as a decoder to reconstruct the original image from the instantiation parameters.

If you are interested in trying out a capsule network for yourself, check out this jupyter notebook.  There, I am explaining and implementing the architecture step-by-step with the FashionMNIST dataset and pytorch.

So are Capsule Networks the new CNNs?

No, not yet.
Capsule networks achieve comparable results on MNIST, while requiring less training data. They are robust against small affine transformations. Also they are good for detecting overlapping objects. This is promising for image segmentation.
But because a capsule network tries to “explain” everything in an image, they perform badly on more complex datasets. Additionally, Capsule Networks need longer time to be trained.
A possible use case is the integration of capsule layers into CNNs.

Also they are nice for visualization:

We can see how different instantiation parameters affect the output of the network. It is not always possible to map parameters to attributes, but we can see how changing the output parameters affects the reconstruction.

I recommend to check out the visualization yourself and tweak the parameters a bit, it’s fun!

Wanna learn more about capsules?

Those are nice explanations of Capsule Networks:

And this is the original paper:
Sara Sabour et al., 2017, https://arxiv.org/pdf/1710.09829.pdf