Automate Machine Learning process with Neo4j stored procedures
If you want to learn more about Machine Learning and how to use it there are many great tutorials across the web. In most of them, you will likely use a scripting language like Python and simple CSV files as a store for your data sets. Integrating such a machine learning process into a bigger distributed application can become challenging. Data preprocessing, retraining of the model on new data and updating the model requires quite an amount of time. If the application is constantly invoking predictions on the model it comes to latency due to the retraining process. In this post, I like to share with you the solution I designed to optimize the machine learning process and the application. Or More precisely: I will show you how I automated the Machine Learning process of my specific project.
The situation before
The goal is to monitor the status of a production database. An example use-case is that we want to monitor the growth of a critical database table. To achieve this we create statistic reports about the table counts of that database. We want to see how the number grew in the past. And we also want to predict how it will develop in the future. For that we use a Machine Learning process that makes these predictions.
The following image shows how I visualized the statistics with the web application “DB-Monitor”.
Workflow of the old application (called “DB-Monitor”)
The productive database is the one we want to monitor. We periodically generate the relevant key figures and send it to the Neo4j database which we use as a metadata store (1). Neo4j sets great value on the relations between data and provides a fast and intuitive data access. Because of that it is very suitable for storing and analyzing metadata.
DB-Monitor directly imports this data via the BOLT protocol (2). This is an interface specialized for fast data access to database applications like Neo4j.
The app creates a histogram of this data and provides it to the user (3).
For computing the predictions on this data, a more complex workflow is necessary.
The first step is the analysis of the existing data to determine regularities therein. For that I have to manually export data to a CSV file and then run Machine Learning scripts on it (4). After the scripts stop, we push the found model parameters to DB-Monitor so that it can use the model to predict the future value of a specific object based on the present data (5). At the end, the application visualizes the predictions in a histogram based on the timeline.
The workflow requires quite an amount of manual effort to feed the machine learning scripts and to push the updated machine learning model into the application. My goal was to automate this workflow and improve the performance of the machine learning component.
The Idea is that we relocate the machine learning process into the Neo4j instance, where the process can directly consume the data and periodically generate a fresh model. The web application can predict by directly accessing the model in Neo4j via the BOLT protocol.
This concept has the following advantages:
Consistent data and no unnecessary additional exports.
One fast access point to the database
The prediction reflects newly added data
Definition and execution of Machine Learning logic solely in the database and not scattered around the whole system.
Manual operations are completely automated
Neo4j offers the possibility to define stored procedures. This is great because with stored procedures we can implement much more complex functionality than with the Neo4j specific query language Cypher. We write the procedures in Java, build them with Maven and simply copy the JAR file to the plugins folder of our Neo4j install directory. After restarting the database (note that this can be critical in certain productive environments) we can call the stored procedures via Cypher from every data access point.
The idea was to write my own procedures in Java that can perform Machine Learning algorithms. Luckily I found a work in progress project called Neo4j Machine Learning Procedures. The idea of this project is to build a bridge between Neo4j and different machine learning frameworks and libraries. Users shall be able to use their framework of choice and make predictions directly on the data.
Unfortunately, the project is still work in progress. So far you can just choose between the frameworks Encog and Dl4j and currently they only work for some classification problems. The architecture, on the other hand, is well thought out in my opinion. I therefore decided to use it and write an implementation for my specific use case. In relation to performance, I think it’s always a good idea to consider using a specific solution over a general one.
Understanding the stored procedures
There are many different Machine Learning techniques that you can use and modify to solve your individual problems. Almost all of these techniques are based on the idea of some training and subsequent prediction and only differ in the implementation of this the two sub-techniques. The Neo4j project takes advantage of this fact. It provides a user-defined stored procedure architecture which offers the developers the possibility to write their own implementation with a favorite framework. This way we can implement the machine learning process in form of the Neo4j stored procedures.
The architecture is very simple and consists of two classes and the before mentioned implementation classes. The class ML implements the stored procedures which contain method calls of the abstract class MLModel. The framework specific implementation classes extend MLModel and contain the Machine Learning logic.
The architecture of the stored procedures project
The general idea is that we can create a Machine Learning model for a specific use case and use it to solve our problem. It is possible to have several different models at the same time and work with them in parallel.
You have 6 different procedures to work with:
create: Set up a new model and decide which framework to use
add: Add training data to the model
train: Train the model on the added data
predict: Predict data on the trained model
info: Give information about the model (e.g. status)
remove: Delete the model
We need to call the first four procedures in the specified order. If we add new data to an already trained model we need to train it again.
After playing around a bit with the procedures and understanding them it’s time to write our own implementation class.
Applying the user-defined stored procedures
An implementation class needs to provide a constructor and functions for training and predicting a model. I started off by writing a simple implementation for linear regression. Hauke already wrote a great blog about that topic.
The constructor is invoked when the “create” procedure is called and initializes a new model with the framework specific parameters. These parameters are used by the train and predict function.
For the Machine Learning part, I used the library Nd4j which stands for “n-dimensional arrays for Java” and aims to be a high performing solution for scientific computing. I also decided to add some extra functionality to the procedures and removed parts I don’t need. My version can be found on my personal GitHub, feel free to try it out.
We can now deploy the stored procedures to the database.
The new web-application
“MonitorIT” is my new revised application. It is based on the Spring Boot framework and uses the Spring Data Neo4j extension (1). This makes it fairly easy to define Queries for accessing the data and the Machine Learning stored procedures in the database (2). The queries that call the stored procedures are parameterized for different use cases and execute pretty fast because of the usage of the BOLT protocol (3). The queries return both existing and predicted data. MonitorIT just needs to visualize this data as a histogram and provide it to the user (4).
Workflow of the new enhanced application (called “MonitorIT”)
Users access the web interface and request different graphs of existing and predicted data.
The application at first had still one big problem. When a user requested a prediction graph MonitorIT created, trained, predicted and in a final step deleted a new model for that specific request. But based on how many data you add to the model and the power of your underlying physical machine, the training can take up to several seconds. This slows down the whole process. I therefore chose to create and train the machine learning models periodically with a scheduler. When a user now requests a prediction graph the web-application just calls the “predict” procedure and replies the result almost instantly after the request.
The following images exemplarily show how the finished graphs look in the web interface. The first one contains the existing data while the second one represents the predicted values. Users can limit the data by defining a start and an end date.
Graph for already existing data
Graph for predicted data
The new application entirely lives up to the promises of the concept idea. The data is consistently stored in the database and is never exported but only accessed through the BOLT protocol. With the scheduler, newly added data is almost instantly considered by the predictions. This prediction logic also completely moved from the application to the database where the whole Machine Learning process is defined. And this process now works automatically.
I can now focus on expanding the functionalities and provide more useful graphs. As a next step, I will work on new implementations that allow me to predict a more complex problem.
Machine Learning has reached a level where it’s fairly easy to create the logic itself. There are many great tutorials that can help you out even if you are new to the subject. The actual difficulty is the implementation of your Machine Learning process into a bigger system. Often such a process is just a small piece. But if we badly integrate it, it can impact the whole system in a negative way.
It is important to consider this fact when you start working on a new project with a Machine Learning component. In many occasions, it is worth thinking a bit outside the box in order to find the best solution for your individual use case. Sticking to a certain tutorial can be fast and easy, but you must go beyond and try out something more until you are able to create something unique that really fits your needs. Along the journey, you may also open up new possibilities for you and even other people.
What are your experiences and thoughts about this?