# Machine Learning Basics - Logistic Regression from Scratch

For the implementation I will use the library ND4j which also powers the Java machine learning framework Deeplearning4j.

You can put most machine learning methods in two (very) broad categories: methods for regression and methods for classification. In regression you try to predict a numerical value for given inputs and in classification you try to match the inputs to two or more categories.

## From linear regression…

Logistic regression (despite its name) is a classification method. With it I can sort different inputs in categories or classes. For instance I can try to tell the exact species of a flower by looking at some of its characteristics like size or shape of its leaves or petals.

Of course you can use logistic regression for financial, medical or many other types of problems.

But first I will explain how an even simpler regression method works, because it helps understanding what is going on later. One of the easiest regression methods is linear regression which you probably already know. Maybe not by its name or in full detail but the base concept will probably sound familiar to you:

As an example lets say I have the number of hours a student studied for an exam and from these I want to predict the exam score:

Student A: Hours studied: 6; Exam score: 60%

Student B: Hours studied: 10; Exam score 90%

For linear regression I take these two points, fit a line trough them and, voilà, I’m done. With this line you can now predicts values for my third point.

In more technical terms: My line can be described as a mathematical function (or as it is called in machine learning terms “a model”).

I call my value “hours studied” *x* and my value “score received” *y*. So I can describe my function as or more generally as where *a* is 15 and *b* is 7.5.

Now I can input different so called “features” *x* in my model and out comes a predicted value *y*.

Later I can refine my model and maybe add another feature *x _{2}*. That gives me a modified model .

I multiply my new feature *x _{2}* with another constant

*c*. If I add more and more features I will eventually run out of letters for the constants. So I rewrite my model with a more generalized notation: .

Also I add yet another feature

*x*to the model.

_{0}*x*is always one so it does not change anything for now. But it makes the equation look a little bit nicer: . But how do I choose values for my

_{0}*thetas*? I will aswer that question later in the post. For now I have a regression model that can predict some numerical values.

But my goal is not to predict some numerical values. I want to know if a certain flower is an iris or a daisy. How can I do that? Well, with some modification I can turn a simple linear regression into logistic regression which can exactly do that.

## … to logistic regression

I will start with a simple objective: I will not ask what kind of flower my example is but if my example is a certain type of flower. Or even more precisely: What is the probability that my example is an iris?

To calculate this probability I first need some feature I can use in my model. Lets pick the length of the flower petals as an example. Lets say my flower petal length of two centimeters.

If I use that as input for some basic linear regression model where every constant is just one I get: .

But that is not the probability that this is an iris.

Especially so since probabilities should always be between zero and one (or between 0% and 100%). So whatever I do in my model it needs to put out a number in that range.

There is a nice mathematical function that ensures exactly that. This function is also the reason why logistic regression is named as it is. The function is the “standard logistic function” or the “sigmoid function” and the equation for it looks like this:

Since this looks a little bit complicated I will explain what the function does.

The part in the lower right of the function is *e ^{-x}* and x is my input into the sigmoid function.

If I draw only the

*e*part it looks like this:

^{-x}If the input *x* gets bigger the output *y* gets close to zero, if *x* gets smaller or negative *y* gets really big.

Now I add one to it and get the complete bottom part of the function: . That has the effect that the output y now gets no longer close to zero but close to one for big values of x.

To complete the function I take one and divide it by all the stuff I got so far. What will happen? If I divide one by a big number I get a small or close to zero number. If I divide one by a number that is just a little bit bigger than one I get a number that is just a little bit smaller than one. So the plot of the whole function looks like this:

## Turning math into code

To bring the sigmoid function from math to code is quite easy with the help of ND4J which also powers DeepLearning4j, a framework for machine learning in Java. ND4J specialty is working with multi-dimensional matrices (n-dimensional array). These are represented in INDArray objects.

I will also use an INDArray as input to my sigmoid function here. That ensures that I can input any array in my function later and the sigmoid value will be calculated for each element in the array.

1 2 3 4 5 6 7 8 9 10 11 |
private static INDArray sigmoid(INDArray Z){ //Z = (-Z) Z = Z.mul(-1.0); //Z = e^(Z); do not create new array Z = exp(Z,false); //1 + Z Z = Z.add(1.0); //1.0 / Z Z = Z.rdiv(1.0); return Z; } |

All operations that I need to do here have very nice convenience wrappers around them which shortens the code tremendously. One thing to note here is the “rdiv” method in line 9 which is short for “reverse division”. This method takes the input number (or array for that matter) and divides it elementwise by the calling array. For an overview of all methods of the INDArray class take a look at the documentation.

(NOTE: ND4J comes with its own sigmoid function. I created my own for illustration purposes.)

Now we have a nice function that can turn input into a value between zero and one. To calculate a probability I can now take an feature *x* as input for my model, get an output *z* from my model, put that as an input in the sigmoid function and get an output probability y between zero and one.

But in fact I do not only want to calculate one output but multiple ones at a time. The sigmoid function can already do that, so it’s only logical that I can calculate multiple values for *z* at a time. Lets also say I want to do that for multiple features e.g. *x _{1}*,

*x*,

_{2}*x*,

_{3}*x*. I can use matrix multiplication to do that.

_{4}If I have only one example I can calculate

*z*as follows:

Now I can add additional inputs as additional rows in the matrix:

For more information on matrix multiplication you can look at this article on betterexplained.com

If you like the visualizations look at matrixmultiplication.xyz where you can play around with different matrices.

Now I want to bring that into code and also put the result into the sigmoid function. Really easy:

1 2 3 4 |
private static INDArray calculateOutput (INDArray X, INDArray theta){ INDArray z = X.mmul(theta); return sigmoid(z); } |

That is all nice mathematical stuff but it not actual machine learning yet. I’m currently just calculating outputs from inputs. The actual learning part is still missing.

## Bring in the machine learning

For machine learning to work you need some learning material. This material is a so called training dataset. Here are some examples of features from a training set. Again *x _{0}* is always one.

The features could for example be the lengths and widths of petals and leafs respectively. Also for the training set I already know if the flower is an iris or not. So I have my *y*s already and they are always exactly one or zero.

To start learning I need some initial guess for my parameters theta. I randomly choose all ones for theta 😉

Now I will try to find really good *theta*s. “Good” in this case means that the *theta*s will be choosen in such a way that the error between the prediction and the real outcome will be as small as possible. For choosing the *theta*s I will do multiple rounds of calculation. In each round I will take a look at my parameters *theta* to see if they produce good *y*s. Then I will update the thetas for (hopefully) better results.

First I calculate my current guess for all *y*s. I will call this guess “*h*“. Then I will calculate the difference between the real values and my guessed values.

With these difference I will now calculate a value for each of my parameters *theta*. These values are called “gradients” and will be used to change the thetas later. To put it really, really simple: For each *theta* the gradients look at how much “wrongness” the *theta* puts into the result. If a *theta* puts much “wrongness” in the result it is corrected a lot. If a *theta* puts less “wrongness” in the result it is changed less.

To calculate the *theta*s I multiply the difference between the real values and the guessed values with the transpose of *X*. Transposing a matrix means switching rows to columns which is neccessary to calculate the *theta*s correctly. The *theta*s are divided by the number of examples to normalize the values.

Note: Actually the gradients are gradients of the cost function of our model that are used in the gradient descent optimization process. If you are interested in a real good introduction to this and many more topics I really recommend Andrew Ng’s machine learning course at coursera.

1 2 3 4 5 6 7 8 9 10 11 12 |
private static INDArray gradientFunction(INDArray theta, INDArray X, INDArray y){ //number of samples int m = X.size(0); INDArray h = calculateOutput(X, theta); // difference between predicted and actual class INDArray diff = h.dup().sub(y); return X.dup() .transpose() .mmul(diff) .mul(1.0 / (double)m); } |

With this gradient calculation done I can start the learning (or training) part of machine learning.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
private static INDArray training(double alpha, INDArray X, INDArray y, int maxIterations, double epsilon){ //set random seed Nd4j.getRandom().setSeed(1234); //initialize theta randomly INDArray theta = Nd4j.rand(X.size(1), 1); INDArray newTheta = theta.dup(); INDArray optimalTheta = theta.dup(); for (int i = 0; i<maxIterations; i++){ INDArray gradients = gradientFunction(theta, X, y); //calculate new theta with gradients and learning rate alpha gradients = gradients.mul(alpha); newTheta = theta.sub(gradients); if (hasConverged(theta, newTheta, epsilon)){ System.out.println("Done"); break; } theta = newTheta; } optimalTheta = newTheta; return optimalTheta; } private static boolean hasConverged(INDArray oldTheta, INDArray newTheta, double epsilon){ double diffSum = abs(oldTheta.sub(newTheta)).sumNumber().doubleValue(); return diffSum/(double)oldTheta.size(0)<epsilon; } |

First I create some random thetas. For each of my features *x* that are stored in the INDArray X I need a theta. With these thetas I can now run a loop as often as I like and in each iteration of the loop I calculate new gradients. The learning rate alpha controls how aggressively the new values of theta are calculated.

I introduce a termination criteria epsilon and test if the difference between the old and new values of theta is smaller than this criteria. If that is the case I can stop the loop since theta will no longer change in a meaningful way.

## Time for some data

Now it’s finally time to bring in some real data. For this post I’m going to use the well known “iris dataset”. This dataset consists of 150 examples of flowers of three species: Iris Setosa, Iris Versicolour and Iris Virginica

These are the classes I’m going to predict from the given features. Each class has a class number which is an integer from 1 to 3.

Each example has 4 features:

1. sepal length in cm

2. sepal width in cm

3. petal length in cm

4. petal width in cm

These are plots of combinations of these features.

To work with this dataset I first need to get the data out of the csv file and into a structure that is compatible to ND4J’s INDArrays. In this case I choose the Dataset class which consist of two INDArrays: one for features and the other one for class labels. This makes handling of the data much easier.

To convert the csv file to a list of dataset I first create a buffered reader to read the raw csv file and then use Java’s Stream API for mapping.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
private static List getIrisData() { String csvFile ="/iris.data.int.csv"; BufferedReader reader = null; String cvsSplitBy = ","; List irisData = new ArrayList<>(); try { reader = new BufferedReader(new FileReader((new ClassPathResource(csvFile)).getFile())); irisData = reader.lines() .map(mapCSVRowToDataset) .collect(Collectors.toList()); } catch (IOException e) { e.printStackTrace(); } finally { if (reader != null) { try { reader.close(); } catch (IOException e) { e.printStackTrace(); } } } return irisData; } private static Function<String, DataSet> mapCSVRowToDataset = (String line) -> { //SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species //5.1,3.5,1.4,0.2,1 double[] parsedRow= Arrays.stream(line.split(",")) .mapToDouble(Double::parseDouble) .toArray(); //get number of columns int columns = parsedRow.length; // create dataset with features and labels from csv row return new DataSet( Nd4j.create( Arrays.copyOfRange(parsedRow, 0, columns-1) ), Nd4j.create( Arrays.copyOfRange(parsedRow,columns-1, columns) ) ); }; |

First I split the lines and parse them to doubles. In the second step I convert the double arrays to INDArrays, split them into features and classes and build datasets from them.

The dataset consists of three classes but everything I explained before works only with one class, right? So how does all of it work with multiple classes? Well, there is a little trick there: I will build a model for each class. Each model can only tell me the probability whether the given flower belongs to the class of the model or not. Then I will pick the class where the probability is the highest.

In my current dataset the class label are numbers between one and three. To get nice y probability values that are between zero and one I need an little helper function that converts these arrays of class labels to arrays with probabilities.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
private static INDArray filterClassLabels (INDArray labels, Number label) { //returns an array with zeros where labels[i]!=label and ones where labels[i]==label INDArray classn = labels.dup(); //replace labels[i]!=label with zeros classn = Nd4j.getExecutioner().execAndReturn(new CompareAndSet( classn, 0.0, new NotEqualsCondition(label) )); //replace labels[i]==label with ones classn = Nd4j.getExecutioner().execAndReturn(new CompareAndSet( classn, 1.0, new EqualsCondition(label) )); return classn; } |

First I set all values in the array that do not match my label (for instance “2”) to zero. Then I set all matching numbers to one. After these to transformation I have converted the label array to an array that consists of only ones and zeros.

Also I need a little function that counts correctly classified examples.

1 2 3 4 5 6 7 8 9 |
private static double countCorrectSamples(INDArray labels, INDArray predictedLabels, INDArray features){ int correctSamples = 0; for(int i=0; i<labels.size(0);i++){ if(labels.getDouble(new int[]{i})==predictedLabels.getDouble(new int[]{i})){ correctSamples++; } } return (double)correctSamples; } |

Now I can finally bring all parts together:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
public static void main( String[] args ) { List data = getIrisData(); DataSetIterator iter = new IteratorDataSetIterator(data.iterator(),150); DataSet next = iter.next(); // 80% training data, 20% test data Random r = new Random(987654321); SplitTestAndTrain testAndTrain = next.splitTestAndTrain(120, r); DataSet train = testAndTrain.getTrain(); DataSet test = testAndTrain.getTest(); // prepend constant ones column INDArray features = Nd4j.hstack( Nd4j.ones(train.getFeatures().size(0), 1), train.getFeatures() ); INDArray labels = train.getLabels(); System.out.println("Train"); INDArray class1 = filterClassLabels(labels, 1.0); INDArray class2 = filterClassLabels(labels, 2.0); INDArray class3 = filterClassLabels(labels, 3.0); INDArray thetaClass1 = training(0.01, features, class1, 10000, 0.0001); INDArray thetaClass2 = training(0.01, features, class2, 10000, 0.0001); INDArray thetaClass3 = training(0.01, features, class3, 10000, 0.0001); System.out.println("Theta Class 1"+thetaClass1.toString()); System.out.println("Theta Class 2"+thetaClass2.toString()); System.out.println("Theta Class 3"+thetaClass3.toString()); // prepend constant ones column INDArray testFeatures = Nd4j.hstack( Nd4j.ones(test.getFeatures().size(0), 1), test.getFeatures() ); System.out.println("Test"); INDArray predictedLabels = predictLabels(testFeatures, thetaClass1, thetaClass2, thetaClass3); INDArray testLabels = test.getLabels(); double correctSamples = countCorrectSamples(testLabels, predictedLabels, testFeatures); double accuracy = correctSamples / (double)testLabels.size(0); System.out.println("Correct samples: "+correctSamples); System.out.println("Accuracy: "+accuracy*100+"%"); } |

First I load the dataset. For convenience reason I use the Deeplearning4J DataSetIterator class. Now I can easily split my data in 120 examples for training and 30 examples for testing. The testing data is use in the end to check if my models can really classify the flowers in there correct classes.

Currently the standard *x _{0}* feature which is always one is missing from the feature array. I add this feature with the hstack (“horizontal stack”) function. After that the three

*y*arrays are created from the labels. Then that the training begins. I choose a learning rate of 0.01 and will train for a maximum of 10000 iterations. If the mean difference between the old and the new

*theta*will become smaller than 0.0001 the learning will also stop.

In the final lines I test the learned model with the test dataset. Therefore I add a column of ones to the features and start the prediction. Then I count the correct samples and calculate the accuracy.

In my test run each model trained for 2896, 3545 and 9090 iteration respectively. Out of the 30 test examples 29 where predicted correctly which results in an accuracy of about 96.66%. If you play around with the different parameters (learning rate alpha, number of maximum iterations and termination criteria epsilon) maybe you can find a combination that results in 100% accuracy. There are also other ways to make the prediction better: You could introduce new artificial features like ratios, sums or products of other features. Also I must note that this dataset is really, really tiny for machine learning purposes.

## Wrap up

In this post I tried to explain the concepts and a little bit of the math behind logistic regression and how you can implement it from scratch with ND4J and Deeplearning4J. Most machine learning frameworks and toolboxes already provide functions like linear regression and logistic regression, but implementing the base algorithms ourselves help us gain a deeper understanding of how the machine learning methods work. I also have to note that I skipped over some parts and concepts to simplify things a little bit. If you are interested in an indepth intro to machine learning and the math behind it I can only recommend (again) Andrew Ngs machine learning course at coursera.

## Comment

## anonymous

Can you share you source code and your csv file? Regards!