Nowadays automation is used heavily to optimize processes and facilitate labour-intensive tasks. One task that is not necessarily labour-intensive, but a task that has to be completed with humans in the loop is meter reading, especially of old electricity meters, present in most private households. Tenants and owners have options to submit their meter reading values by e.g. mailing in a postcard, sending a photograph by email or entering them in a web portal on a regular basis to be reviewed by their energy provider to determine the amount of electricity used. Here we focus on the submission method by taking a photograph, especially by using a smartphone. In the days of big data, document scans and photographs in general differentiate themselves from a regular database row entry in that they are missing a predefined structure. The information to be extracted is hidden in the pixels, captured by different cameras, surrounded by noise and taken in varying lighting conditions. For convenience, (semi-) automated meter reading is desirable even for older meters without digital interfaces. This can be confirmed by looking at software offerings of several companies that specialize on edge device domain-specific OCR, for example for ID cards, drivers license, passport or, as in our example, electricity and water meters.
In this post, we are going to lay out the difficulties for the problem of meter reading, propose a domain-specific solution that works without a large dataset and discuss advanced solutions. This can be considered an exercise in analyzing a problem, possible solutions, rating advantages and disadvantages and finally deciding on a viable strategy. We will be using Python and leverage image processing capabilities from OpenCV to implement a prototype.
Let’s take a look at a sample image of an electricity meter.
To extract the values displayed on the meter, the 3 main problems at hand are as follows:
- Localize the digit bar
- Localize every digit displayed
- Classify each digit (with labels 0-9)
Of course, humans master each of these tasks with ease, but all computers are provided with pixel values of the human world.
The nature of digital images brings another set of difficulties to the table:
- camera angle
- lighting conditions
The first task, localizing the digit bar can be accomplished by a learning algorithm, such as a neural network for object detection/localization. These networks are trained on examples that are annotated with bounding boxes that cover the object. Depending on the architecture, this approach can require a lot of samples as well as training time. As is oftentimes the practice with image training data, the dataset can be enriched by augmenting existing samples, that is, applying rotation, zoom, scaling etc. operations in order to essentially generate more samples for the network to learn from.
For simplicity reasons in this prototype, we resort to a very convenient assumption that a lot of ‘camera scanning’ apps such as mobile Barcode Readers make: letting the user place the object inside a predefined bounding box shown on screen.
Now that we sorted out the problem of localizing the digit bar, we want to automatically extract each and every digit to classify. A seemingly trivial tasks for humans, but one that needs some thought put into it to be reliably accomplished by a computer. In the following we will use OpenCV in Python to preprocess our input image, a very important step to aid not only the process of digit recognition, but many computer vision tasks in general. As preprocessing is such a task that occurs repeatedly, it is useful to write convenience methods and classes for it. OpenCV offers bindings to several languages (including mobile support), we make use of Python for its simplicity and capabilities to prototype quickly. Our preprocessing goal is to segment the image in such a way that digit structures become apparent and are isolated, so that we can extract them from the image. All future steps assume a greyscale image.
Applying filters to images is common procedure in most computer vision tasks. Usually, filters are applied by convolving the image with a so-called kernel or filter. This convolution operation is actually really relevant in state-of-the-art deep learning approaches as it is the central technique for Convolutional Neural Networks. Certainly the most famous example for a kernel is the Gaussian Kernel, which causes a blurring effect when applied to an image. Blurring causes details to vanish, because high frequency information is removed by applying it. In that sense, slight blur removes small noise artifacts from our images which helps with the robustness of our approach. Another important filter is the Sobel Operator, which approximates the derivative of the image. This means that at locations with a high rate of change, the derivative will be high. The effect will be that edges are emphasized by high values, while flat areas will have a gradient closer to 0. Working on the image derivative has the advantage of being more robust with respect to the global illumination of the image, which means the derivative of an overall lighter image will be comparable to an image taken in a darker setting.
Both mentioned filters are part of a classic computer vision algorithm, the Canny Edge Detection algorithm. In contrast to just using the Sobel filter, Canny produces sharp, more concise edges. Here we face the problem of finding good parameters that reliably ‘edge’ out all parts of the digits, as illustrated below.
When taking the derivative doesn’t work out well for the result, consider the morphological operators dilation, erosion, and their combinations, which are called top-hat operations, can be used to further remove (bigger) artifacts that blur could not remove.
Finally, a Thresholding operator can be applied to the image to clearly set apart the digits from the background. Values above a certain threshold are set to 1, while the rest and thus hopefully the entire background, is set to 0.
The presented operations are usually chained and the order depends on the input images and domain one is dealing with. For example, after cleaning up the image using morphological operators, one can additionally make use of the taking the derivative.
These steps are subject to experimentation by humans. Finding the best combination of filters, parameters and order of applying them is something that Machine Learning algorithms, especially Convolutional Neural Networks for images, can (at least partially) accomplish for us by learning from data samples. But even if a dataset is available, prototyping with filters and other operations can provide useful insights to the problem domain, allow one to get a feel for the problem and take that as a baseline for a fully-fledged ML approach to compete with.
Now that we set the groundwork, we facilitated our next step, namely isolating the individual digits, at least candidates. For this, we leverage OpenCVs ability to locate prominent contours with the findContours method, which proposes a set of prominent contours it detected for an input image. To proceed further, we retrieve each contours bounding box and filter it for plausible bounding boxes for digits. To accomplish this, we use heuristics to check if the aspect ratio and size of the bounding box are plausible. This discards bounding boxes for very small artifacts or too large flat regions. In the end we are hopefully served with well-fitting bounding boxes for each individual digit.
Now that we extracted digit candidates separately from the image by extracting the contents of the bounding boxes from the unprocessed original image, we wish to assign a label to them with the digit that they most likely represent. In our case, we work with very little data at our hands as we only have a few samples of meter readings. This makes most machine learning models unfeasible and thus limits our possibilities. In turn, we resort to a fairly static technique for this prototype called template matching.
With the premise of having 1 template per digit prepared, we slide the templates over the isolated digits and measure their similarity using the Euclidean distance.
We then use the closest fitting template and use its label as the prediction. This approach, viewed from a machine learning angle, is comparable to a nearest neighbour classification using only the single nearest neighbour to predict the class label, with 10 samples (0-9) available. While the process works pretty well in some cases:
, it can miss isolating digits or misclassify them:
This has several reasons. For one, having only 1 template per digit makes the classification process not robust enough. If classification were to be done using a neural network, the training data usually shows enough variety of angles, rotations, sizes and lighting conditions that the learned classifier is more stable. To this end, techniques during training of neural networks such as data augmentation can help with real-world accuracy of the network. Also, comparing two images directly on their pixels using something like the Euclidean distance is more brittle and prone to noise and small variations in the images.
Again, neural networks could be used to aid this process by using internal representations of e.g. a CNN of an input image as a fixed length feature vector (the CNN acts as a feature extractor). This can either be accomplished using a general purpose pre-trained network such as Inception or ResNet and retrieving the internal representations from the last layer before the fully connected layer that is actually used for classification or by using unsupervised methods such as training a Convolutional Autoencoder network or applying Principal Component Analysis to the dataset. A feature vector encodes (semantic) features of the data better in a relatively low dimension which leads to better semantic comparability of inputs and is less prone to small and possibly noisy variations to the input.
The feasibility of neural networks on mobile and edge devices is repeatedly demonstrated by big smartphone manufacturers, such as Apple using special AI hardware and providing the CoreML framework for ML-powered apps, as well as Google continuously releasing updates for TensorFlow Lite for mobile and embedded systems. This means that meter readings can be accomplished directly on the device, without any kind of internet connection available.
We demonstrated an approach that can aid in automated meter reading using very little data, by replacing a ‘learning’ aspect by engineering features and making assumptions about the problem at hand. We separated the task, identified problems and proposed appropriate solutions. With a reasonably large dataset, different learning-based approaches can be used, which come with the premise of being more stable, robust and generally more suitable for production.