user-icon Hauke Brammer
26. July 2017
timer-icon 13 min

Machine Learning Basics - Logistic Regression from Scratch

In this post I will give an introduction to logistic regression, an powerful yet easy to implement machine learning method. I will explain some of the mathematical concepts behind it and will demonstrate how to implement it.

For the implementation I will use the library ND4j which also powers the Java machine learning framework Deeplearning4j.

You can put most machine learning methods in two (very) broad categories: methods for regression and methods for classification. In regression you try to predict a numerical value for given inputs and in classification you try to match the inputs to two or more categories.

From linear regression…

Logistic regression (despite its name) is a classification method. With it I can sort different inputs in categories or classes. For instance I can try to tell the exact species of a flower by looking at some of its characteristics like size or shape of its leaves or petals.
Of course you can use logistic regression for financial, medical or many other types of problems.

But first I will explain how an even simpler regression method works, because it helps understanding what is going on later. One of the easiest regression methods is linear regression which you probably already know. Maybe not by its name or in full detail but the base concept will probably sound familiar to you:

As an example lets say I have the number of hours a student studied for an exam and from these I want to predict the exam score:
Student A: Hours studied: 6; Exam score: 60%
Student B: Hours studied: 10; Exam score 90%
For linear regression I take these two points, fit a line trough them and, voilà, I’m done. With this line you can now predicts values for my third point.
Naive linear regression with only two points
In more technical terms: My line can be described as a mathematical function (or as it is called in machine learning terms “a model”).
I call my value “hours studied” x and my value “score received” y. So I can describe my function as y = 15   7.5 * x or more generally as y = a   b * x where a is 15 and b is 7.5.

Now I can input different so called “features” x in my model and out comes a predicted value y.

Later I can refine my model and maybe add another feature x2. That gives me a modified model y = a   b * x_1   c* x_2.

I multiply my new feature x2 with another constant c. If I add more and more features I will eventually run out of letters for the constants. So I rewrite my model with a more generalized notation: y= \theta_0   \theta_1 * x_1   \theta_2 * x_2.
Also I add yet another feature x0 to the model. x0 is always one so it does not change anything for now. But it makes the equation look a little bit nicer: y= \theta_0 * x_0   \theta_1 * x_1   \theta_2 * x_2. But how do I choose values for my thetas? I will aswer that question later in the post. For now I have a regression model that can predict some numerical values.

But my goal is not to predict some numerical values. I want to know if a certain flower is an iris or a daisy. How can I do that? Well, with some modification I can turn a simple linear regression into logistic regression which can exactly do that.

… to logistic regression

I will start with a simple objective: I will not ask what kind of flower my example is but if my example is a certain type of flower. Or even more precisely: What is the probability that my example is an iris?
To calculate this probability I first need some feature I can use in my model. Lets pick the length of the flower petals as an example. Lets say my flower petal length of two centimeters.
If I use that as input for some basic linear regression model where every constant is just one I get: y = 1 * 1   1 * 2 = 3.

But that is not the probability that this is an iris.
Especially so since probabilities should always be between zero and one (or between 0% and 100%). So whatever I do in my model it needs to put out a number in that range.
There is a nice mathematical function that ensures exactly that. This function is also the reason why logistic regression is named as it is. The function is the “standard logistic function” or the “sigmoid function” and the equation for it looks like this:

y= \sigma (x) = \frac{1}{1 e^{-x}}

Since this looks a little bit complicated I will explain what the function does.

The part in the lower right of the function is e-x and x is my input into the sigmoid function.
If I draw only the e-xpart it looks like this:

Exponential function with e^(-x)

If the input x gets bigger the output y gets close to zero, if x gets smaller or negative y gets really big.

Now I add one to it and get the complete bottom part of the function: 1   e^ {-x}. That has the effect that the output y now gets no longer close to zero but close to one for big values of x.
To complete the function I take one and divide it by all the stuff I got so far. What will happen? If I divide one by a big number I get a small or close to zero number. If I divide one by a number that is just a little bit bigger than one I get a number that is just a little bit smaller than one. So the plot of the whole function looks like this:

S-shaped sigmoid function

Turning math into code

To bring the sigmoid function from math to code is quite easy with the help of ND4J which also powers DeepLearning4j, a framework for machine learning in Java. ND4J specialty is working with multi-dimensional matrices (n-dimensional array). These are represented in INDArray objects.
I will also use an INDArray as input to my sigmoid function here. That ensures that I can input any array in my function later and the sigmoid value will be calculated for each element in the array.


All operations that I need to do here have very nice convenience wrappers around them which shortens the code tremendously. One thing to note here is the “rdiv” method in line 9 which is short for “reverse division”. This method takes the input number (or array for that matter) and divides it elementwise by the calling array. For an overview of all methods of the INDArray class take a look at the documentation.

(NOTE: ND4J comes with its own sigmoid function. I created my own for illustration purposes.)

Now we have a nice function that can turn input into a value between zero and one. To calculate a probability I can now take an feature x as input for my model, get an output z from my model, put that as an input in the sigmoid function and get an output probability y between zero and one.

But in fact I do not only want to calculate one output but multiple ones at a time. The sigmoid function can already do that, so it’s only logical that I can calculate multiple values for z at a time. Lets also say I want to do that for multiple features e.g. x1, x2, x3, x4. I can use matrix multiplication to do that.
If I have only one example I can calculate z as follows:

z = \begin{pmatrix} x_{1,0} & x_{1,1} & x_{1,2} & x_{1,3} & x_{1,4} \\ \end{pmatrix} * \begin{pmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \theta_3 \\ \theta_4 \\ \end{pmatrix}

Matrix multiplication animated

Now I can add additional inputs as additional rows in the matrix:

\begin{pmatrix} z_1 \\ z_2 \\ \vdots \\ z_m \\ \end{pmatrix} = \begin{pmatrix} x_{1,0} & x_{1,1} & x_{1,2} & x_{1,3} & x_{1,4} \\ x_{2,0} & x_{2,1} & x_{2,2} & x_{2,3} & x_{2,4} \\ \vdots & & & \vdots \\ x_{m,0} & x_{m,1} & x_{m,2} & x_{m,3} & x_{m,4} \\ \end{pmatrix} * \begin{pmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \theta_3 \\ \theta_4 \\ \end{pmatrix}

For more information on matrix multiplication you can look at this article on betterexplained.com
If you like the visualizations look at matrixmultiplication.xyz where you can play around with different matrices.

Now I want to bring that into code and also put the result into the sigmoid function. Really easy:


That is all nice mathematical stuff but it not actual machine learning yet. I’m currently just calculating outputs from inputs. The actual learning part is still missing.

Bring in the machine learning

For machine learning to work you need some learning material. This material is a so called training dataset. Here are some examples of features from a training set. Again x0 is always one.

X =\bordermatrix{ &x_0&x_1&x_2&x_3&x_4\cr & 1 & 5.1 & 3.5 & 1.4 & 0.2 \cr & 1 & 4.9 & 3.0 & 1.4 & 0.2 \cr & 1 & 7.7 & 2.8 & 6.7 & 2.0 \cr & 1 & 6.3 & 2.7 & 4.9 & 1.8 \cr }y = \begin{pmatrix} 1 \\ 1 \\ 0 \\ 0 \\ \end{pmatrix}

The features could for example be the lengths and widths of petals and leafs respectively. Also for the training set I already know if the flower is an iris or not. So I have my ys already and they are always exactly one or zero.

To start learning I need some initial guess for my parameters theta. I randomly choose all ones for theta 😉

\theta = \begin{pmatrix} 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ \end{pmatrix}

Now I will try to find really good thetas. “Good” in this case means that the thetas will be choosen in such a way that the error between the prediction and the real outcome will be as small as possible. For choosing the thetas I will do multiple rounds of calculation. In each round I will take a look at my parameters theta to see if they produce good ys. Then I will update the thetas for (hopefully) better results.

First I calculate my current guess for all ys. I will call this guess “h“. Then I will calculate the difference between the real values and my guessed values.

h = \sigma (X * \theta) = \sigma ( \begin{pmatrix} 11.2 \\ 10.5 \\ 20.2 \\ 16.7 \\ \end{pmatrix} ) \approx \begin{pmatrix} 1 \\ 1 \\ 1 \\ 1 \\ \end{pmatrix}

h - y = \begin{pmatrix} 1 \\ 1 \\ 1 \\ 1 \\ \end{pmatrix} - \begin{pmatrix} 1 \\ 1 \\ 0 \\ 0 \\ \end{pmatrix} = \begin{pmatrix} 0 \\ 0 \\ 1 \\ 1 \\ \end{pmatrix}

With these difference I will now calculate a value for each of my parameters theta. These values are called “gradients” and will be used to change the thetas later. To put it really, really simple: For each theta the gradients look at how much “wrongness” the theta puts into the result. If a theta puts much “wrongness” in the result it is corrected a lot. If a theta puts less “wrongness” in the result it is changed less.
To calculate the thetas I multiply the difference between the real values and the guessed values with the transpose of X. Transposing a matrix means switching rows to columns which is neccessary to calculate the thetas correctly. The thetas are divided by the number of examples to normalize the values.

g = X^\top * (h-y) * \frac{1}{m} = \begin{pmatrix} 1 & 1 & 1 & 1\\ 5.1 & 4.9 & 7.7 & 6.3\\ 3.5 & 3.0 & 2.8 & 2.7\\ 1.4 & 1.4 & 6.7 & 4.9\\ 0.2 & 0.2 & 2.0 & 1.8\\ \end{pmatrix} * \begin{pmatrix} 0 \\ 0 \\ 1 \\ 1 \\ \end{pmatrix} * \frac{1}{4} = \begin{pmatrix} 2 \\ 14 \\ 5.5 \\ 11.6 \\ 3.8 \\ \end{pmatrix} * \frac{1}{4} = \begin{pmatrix} 0.5 \\ 3.5 \\ 1.375 \\ 2.9 \\ 0.95 \\ \end{pmatrix}

Note: Actually the gradients are gradients of the cost function of our model that are used in the gradient descent optimization process. If you are interested in a real good introduction to this and many more topics I really recommend Andrew Ng’s machine learning course at coursera.


With this gradient calculation done I can start the learning (or training) part of machine learning.


First I create some random thetas. For each of my features x that are stored in the INDArray X I need a theta. With these thetas I can now run a loop as often as I like and in each iteration of the loop I calculate new gradients. The learning rate alpha controls how aggressively the new values of theta are calculated.
I introduce a termination criteria epsilon and test if the difference between the old and new values of theta is smaller than this criteria. If that is the case I can stop the loop since theta will no longer change in a meaningful way.

Time for some data

Now it’s finally time to bring in some real data. For this post I’m going to use the well known “iris dataset”. This dataset consists of 150 examples of flowers of three species: Iris Setosa, Iris Versicolour and Iris Virginica

These are the classes I’m going to predict from the given features. Each class has a class number which is an integer from 1 to 3.

Each example has 4 features:
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm

These are plots of combinations of these features.Plots with the features of the iris dataset

To work with this dataset I first need to get the data out of the csv file and into a structure that is compatible to ND4J’s INDArrays. In this case I choose the Dataset class which consist of two INDArrays: one for features and the other one for class labels. This makes handling of the data much easier.
To convert the csv file to a list of dataset I first create a buffered reader to read the raw csv file and then use Java’s Stream API for mapping.


First I split the lines and parse them to doubles. In the second step I convert the double arrays to INDArrays, split them into features and classes and build datasets from them.

The dataset consists of three classes but everything I explained before works only with one class, right? So how does all of it work with multiple classes? Well, there is a little trick there: I will build a model for each class. Each model can only tell me the probability whether the given flower belongs to the class of the model or not. Then I will pick the class where the probability is the highest.

In my current dataset the class label are numbers between one and three. To get nice y probability values that are between zero and one I need an little helper function that converts these arrays of class labels to arrays with probabilities.


First I set all values in the array that do not match my label (for instance “2”) to zero. Then I set all matching numbers to one. After these to transformation I have converted the label array to an array that consists of only ones and zeros.

Also I need a little function that counts correctly classified examples.


Now I can finally bring all parts together:


First I load the dataset. For convenience reason I use the Deeplearning4J DataSetIterator class. Now I can easily split my data in 120 examples for training and 30 examples for testing. The testing data is use in the end to check if my models can really classify the flowers in there correct classes.
Currently the standard x0 feature which is always one is missing from the feature array. I add this feature with the hstack (“horizontal stack”) function. After that the three y arrays are created from the labels. Then that the training begins. I choose a learning rate of 0.01 and will train for a maximum of 10000 iterations. If the mean difference between the old and the new theta will become smaller than 0.0001 the learning will also stop.

In the final lines I test the learned model with the test dataset. Therefore I add a column of ones to the features and start the prediction. Then I count the correct samples and calculate the accuracy.

In my test run each model trained for 2896, 3545 and 9090 iteration respectively. Out of the 30 test examples 29 where predicted correctly which results in an accuracy of about 96.66%. If you play around with the different parameters (learning rate alpha, number of maximum iterations and termination criteria epsilon) maybe you can find a combination that results in 100% accuracy. There are also other ways to make the prediction better: You could introduce new artificial features like ratios, sums or products of other features. Also I must note that this dataset is really, really tiny for machine learning purposes.

Wrap up

In this post I tried to explain the concepts and a little bit of the math behind logistic regression and how you can implement it from scratch with ND4J and Deeplearning4J. Most machine learning frameworks and toolboxes already provide functions like linear regression and logistic regression, but implementing the base algorithms ourselves help us gain a deeper understanding of how the machine learning methods work. I also have to note that I skipped over some parts and concepts to simplify things a little bit. If you are interested in an indepth intro to machine learning and the math behind it I can only recommend (again) Andrew Ngs machine learning course at coursera.

Comment article

Comment

  1. anonymous

    Can you share you source code and your csv file? Regards!