27. August 2018
7 min

Deep Learning for End-to-End Captcha Solving

Automation is great! And so I want to automate everything, even during load testing. However one of the most annoying things that constantly comes up when creating load test for webpages are captchas. They come in various shapes and forms, are annoying and sometimes hard to solve. Yet, their entire purpose is to differentiate humans from machines. This premise has repeatedly shown to not hold against all kinds of approaches, recent ones focusing on using Machine Learning to boost concepts from Computer Vision. After reading up on several approaches, like this one, we were motivated to look into solving a custom captcha that introduces additional problems for automated solving. In this post, we outline our approach and guide you solving similar problems using Deep Learning.


Let’s quickly analyze what type of captcha we are dealing with. Some wiggly 3D meshgrid with 4 characters embedded into it that presents itself to us as a 2D image.
The captchas we want to solve has different criteria which we need to deal with:

  • Inclination of the captcha to the left or right
  • Distortion of numbers and letters
  • High and low detection (distinction between digit/number and background)
  • Background/noise  cleanup
  • No easy character separation using contours

In addition, we didn’t have access to the captcha generator source code. So there isn’t an easy way of generating labeled samples. Nonetheless, patterns exist in the captchas we need to solve, and we’re certain that a Convolutional Neural Network (CNN) has the capability to find and use them.
Let’s get to work with Python!

Sample captcha we want to solve


The Python libraries we use are:

  • NumPy, for efficient (multidimensional) arrays
  • OpenCV2, for transforming our images
  • Pillow (a PIL variant), for handling images
  • Keras, for defining and training our model
  • Flask, for hosting the model for actual inference

Preparing the Captcha Dataset

Collecting, preparing and cleaning the data is probably the most time consuming part of this endeavour and should not be underestimated by anyone.

Data Acquisition

First things first, we have to acquire lots and lots of captchas in order to use them for training. In our case, we made repeated requests to the endpoint that provides the captcha images. However, this step is specific to your source of images. Maybe a HTTP request suffices or you need to scrape the webpage using something like Selenium.

After downloading 10,000 images we quickly realized that labelling them each individually is a very tedious task. After labelling 500 we stopped (resulting 4 * 500 = 2000 labelled characters). While this is a very low number of samples to start with in Deep Learning, later on we’ll explore a technique that mitigates the problem of having a small dataset for our case.


The first step of cleaning up our data is to extract images for each of the 4 individual characters from a captcha. During labelling we noticed that the placement of the meshgrid varies only slightly and has two main directions, left and right:

Furthermore, when stacking multiple images of the same direction, we can observe very little variation:

Several stacked left-oriented captchas

So, if we can determine the general direction of the captcha using simple linear regression, we can do fixed perspective transformations for the left and right orientation to obtain a normalized/canonical character sequence. For this, we refer to Adrian Rosebrocks post on how to perform a four-point perspective transformation using OpenCV.

At this point, we essentially removed the third dimension and rotation characteristics from the captchas. Therefore we can make simple vertical cuts at 25%, 50% and 75% to get images for individual characters:

Left: Original CAPTCHA Middle: CAPTCHA after four-point transformation Right: resulting characters

Notice how the individual characters are sometimes not perfect, with possible residuals of adjacent characters in the image? This will not impose a problem, as it even might increase robustness of learning and prevent overfitting.


As we are dealing with images, we resort to Convolutional Neural Networks, which proved to yield excellent results in a lot of image tasks across various simple and complex domains. Furthermore, our input space is limited to relatively basic synthesized captchas, which is why we aim for a network architecture that fits the capacity of our domain. The network we are going to use looks like this:

Network Architecture

Network Architecture, image generated using https://github.com/lutzroeder/Netron

This is a fairly standard CNN with one fully-connected hidden layer at the end, followed by an output layer.
Keras makes it very easy to define such a model:

We have an output size of only 22, not 36, because that is the actual count of distinct characters that are used in our captchas. Probably to prevent readability issues that even humans could have problems with. This sounds like an advantage for our network, less options to choose from.

All samples and their corresponding labels are serialized as numpy arrays which we put into our network. Because we have to numerically encode characters, we encode the labels. For this, you can either roll your own solution or resort to something like sci-kit-learns LabelEncoder.

Data Augmentation

Previously, we noticed that our 2,000 character samples are not a lot of data. For most scenarios, a dataset this small will not get you good results from neural networks. But by observing the nature of the input we are dealing with, we find a set of synthetic images. After all, this is a relatively simple input space with variations for the same character mostly in position, rotation and zoom factor. These are all features we can replicate in an automated manner from the labelled samples we have, and thus effectively increase the size of our dataset without having to label everything manually. Enlarging the dataset like this is called Data Augmentation and generally also helps with robustness of learning, as we force the network to handle rotated and twisted features better, which is also helpful with preventing overfitting. Keras provides convenient methods for this, as we can simply use the built-in ImageDataGenerator to generate images on the fly:

For more information on the topic, or if you want to perform the augmentation on your own, check out this post.

Training the Model

Finally, let’s compile the model and run it:

The hyperparameters are up for you to tune, maybe by doing an exhaustive sweep across the parameter space or something like Bayesian optimization.


At this point we can evaluate the performance using our test set. With an average performance of 95 percent, it’s a very satisfying result already! That means that, on average, a captcha will be solved by the model 0.95^4 approx 81 percent of the time. So 4 out of 5, not bad at all. Additional analysis shows that the network sometimes confuses character pairs with similar structures such as M and W or K and X.

Visualizing Activations

One interesting thing about CNNs is that we can easily visualize what it actually learns! The convolutional filters used in every layer have their roots in computer vision, which is where you can learn more about them. Each layer consists of channels, and each channel actually learns a different filter that it applies to the image, which we can visualize on an input. We do this using the code provided in fchollet’s (Keras creator) excellent series of Jupyter notebooks for Deep Learning with Python.

Visualized Activations of the first two convolutional layers and the first max pooling layer for an input image with a ‘4’

Upon feeding the network an example image of a ‘4’, the activations visualize the inner workings of the network surprisingly intuitive. As higher layers learn convolutions for more general features, lower layers are more concerned with small details.

Using the Model for Inference

Prediction Pipeline

Prediction Pipeline

Most tutorials stop here. They train a model, evaluate it and leave you with the scores on the test set.
But we are actually putting this model to the test now. Let’s use the model in a small Flask Server application that can be called to solve incoming captchas.

With a running server, we can make a POST request:


What did we want to achieve? We wanted to be able to resolve sophisticated captcha automatically with a deep learning approach in order to use them in the daily work of a performance engineer. We did!

This article summarizes which configurations have to be made to solve sophisticated captchas automatically, which steps are necessary to modify the captcha before a model can be trained. Furthermore, this article shows how a Convolutional Neural Network is built and how a captcha service can be implemented in a load test or in a synthetic monitoring.
Our tests have shown that the success rate in solving a captcha automatically is at least 81%. So 4 out 5, this is a sensational result. However, in our opinion, the result is not yet sufficient to be used in a load test. But the use in a synthetic monitoring is very suitable!

Questions? Thoughts? You want stay up-to-date on the topic? We would love to hear from you. Tweet us at @novatecgmbh or email us at apm@novatec-gmbh.de

Comment article