Dive into the World of Data Mining! Part 2: Building a model
A little bit history…
In Germany the federal court in 1953 established a drink-drive limit of 1.5 per mil. This was the first time alcohol consumption from a driver was covered by a law. On 14 June 1973, the German Bundestag reduced this limit to 0.8 per mil and from 01.04.2001 the limit was again lowered to 0.5 per mil. Was this ruling correct and efficient? And when yes, how effective was it?
In Part 1: Introduction to RapidMiner we became acquainted with the functions and capabilities of RapidMiner. In this Part we will build a time-series forecasting model based on an neural network with RapidMiner.
Building the first model
Unfortunately, we have no data from 1953 and 1973 to analyze the effectiveness of the first decision. But we can try to analyze the amount of traffic accidents caused by alcohol before and after 01.04.2001. Then predict how many traffic accidents will happen in the future. This information may be useful especially for the police or other legislative institutions.
First of all, import the data with “Retrieve” operator. If we want to analyze just the cases with alcohol involvement, we should filter out the rows and attributes which are not necessary for our analysis.
With a filter-operator [Blending-> Examples->Filter Examples] we define that just alcohol accidents are relevant.
Fig. 7 Data filtering
Fig. 8 Filter parameters
In the next step let’s select attributes with the “Select attributes” operator [Blending->Attributes-> Selection-> Select Attributes].
Fig. 9 Attribute selection
By the way, here is a small tip: If you press the green arrow next to an parameter, you can see typical choices of RapidMiner uses for some configurations. Sometimes it is useful, if you are not sure what to choose.
Fig. 10 Users selection statistic
If you chose the attribute file type subset, you can select more than one attribute for your model. For our case we select the attributes: MONTH (month&year), WERT (amount of traffic accidents or injured and killed people) and AUSPRÄGUNG (identification: injured and killed or overall).
Fig. 11 Relevant attributes selection
If you click on design and take a look at the statistic, you can see a scatter chart with our three attributes.
Fig. 12 Scatter chart alcohol caused traffic accidents from 2000 to 2017
You can zoom in the area between 2000 and 2002 and you will see the amount of traffic accidents in the years 2001 and 2002 is significantly lower than in 2000. So, we can say that lowering the drink-drive limit was effective and beneficial.
Fig. 13 Zoomed scatter chart alcohol caused traffic accidents from 2000 to 2002
Time series prediction model
In this part we predict the amount of road accidents due to alcohol consume.
First of all, we separate must our data into two parts, test and training. We train our model, by pairing the input with an expected output. It is clear, that our output will be a forecast of the amount of traffic accidents. But what do we use as input?
Let’s assume that there are some dependencies between the amount of traffic accidents last month and in this month. So we generate three new columns with values from the last 3 months. You may have noticed that the sorting of date is descending – from 12.2017 to 01.2000. To predict the future (and not the past, because that would be silly 🙂 ) we should sort our date column with the operator Sort [Blending->Examples-> Sort] and set the sorting direction as ascending.
Now let‘s filter our examples and select the attributes like in our first model – just alcohol cases and three attributes: MONTH (month&year), WERT (amount of traffic accidents or injured and killed people) and AUSPRÄGUNG (identification: injured and killed or overall) are relevant.
The “Windowing” operator transforms the time series data into a generic data set. To explain how that works, the Figure 14 illustrate the idea of windowing. The “Window size” value has the effect of creating a set of new attributes with names ranging from WERT-0 to WERT-3. The step-size value defines how many values to step forward to start each new window.
Fig. 14 Time series windowing
All time series have a date column at the beginning. RapidMiner must be informed that one of the columns in the data set is a date and should be defined as „id“. We can accomplish that with the “Set role” operator.
Fig. 15 Id definition in set role operator
Fig.16 Test and training parts of data set and first steps of modeling
The parameter settings for the “Windowing” operator to achieve this is shown in the following screenshot:
Fig. 17 Windowing parameters
Let’s take a closer look at those parameters and figure out what they mean
“Series representation” is as default always „encode_series_by_examples“. It means we have our time series data as many rows, the windowing looks down and create new horizontal attribute values, i.e. each example encodes the value for a new time point.
If you have many columns with time series data and just one or a couple of rows, you should choose “encode_series_by_attributes”. The “Windowing” operator will look horizontally along the attributes of the set and each value encodes the value for a new time point. If there is more than one example, the windowing is performed for each example independently and all resulting window examples are merged into a complete example set. 
Nice to know: „encode_series_by_examples” is usually performer and more efficient concerning memory usage.
“Window size” defines the number of months RapidMiner will use to predict the future value. If we set it to 3, the algorithm will use 3 months of data to predict the future value.
Step size determines which values to skip or step over, i.e. the distance between the values. That means, if your step size is 4, then RapidMiner will use the values of 1-st, 5-th, 9-th etc.
Horizon shows how far we make the forecast. It means the distance between last value and value to predict. If the window size is 3 and the horizon is 1, then the 4th row of the original time series becomes the first sample of label.
Now we can predict the amount of car accidents using our training and test data. Just select the operator “Validation (Sliding Window Validation)” from [Extensions-> Value Series -> Evaluation -> Validation -> Sliding Window Validation] to validate the model (in this case a linear regression). Check inside the operator to see how the model is trained and how the performance is calculated. Connect the “Validation” operator as in following screenshot:
Fig.18 Test and training parts of data set and first steps of modeling
“Sliding Window Validation” encapsulates sliding windows of training and test in order to estimate the performance of a prediction operator.
In my example I created a training and testing window width of 100 and 90 example rows, step size defines the size of the last testing window. The horizon is how far into the future I want to predict, so I want to know just the values for next month and set it as 1.
Fig.19 Validation parameters
Double clicking on “Sliding Window Validation” will open two parts for “sub-processes”. According to the documentation of the “Neural Net” operator, a neural network is a mathematical model or computational model that is inspired by the structure and functional aspects of biological neural networks. A neural network consists of an interconnected group of artificial neurons, and processes information using a connection approach to computation.  So let’s place the “Neural Network” operator for the training and “Apply Model” Operator for testing part. Actually, “Select Attributes” Operator is optional in this case, there you can select the attributes, which you want to see in your output model. We need a “Performance (Regression)” operator for performance evaluation of regression tasks. Regression is a measure to determine the strength of relationship between one dependent variable (in our case new amount of car accidents) and a series of other changing independent variables (amount of car accidents in last three months).
Fig.20 Test and train parts of data set and first steps of modeling
Certainly, you can try different parameters for the neural net, but first let’s try 500 training cycles. This parameter specifies the number of training cycles and compares the output value with the correct answer to compute the value of the build in error-function. All other parameters can be left as default.
Fig. 21 Neural Network parameters
The blue line represents the actual amount of alcohol caused traffic accidents and the red line predicted the amount. Even though, there is a big difference in the forecast, the trends are often identical. As you can see, my first trend prediction does not look so impressive and it is not precise. Anyway, it was a nice try – since we just took the default or intuitive chosen values we can’t , of course, expect the perfect forecasting model on the first try. But we can make our prediction more precise, we will optimize our model in Part 3: Optimization.
Fig.22 Actual values and prediction chart
Fig.23 Actual values and prediction table
The exact points you can find if you click on [Results -> ExampleSet (Apply Model)]. In the blue box there are the actual values of alcohol caused traffic accidents.
Fig.24 Performance and error rate
More description of the Hidden and Output Layer you can find in Results [Results ->Improved NeuralNet -> Description].
Fig. 25 Layers of Neural Network
1. Forecasting result:
As a result of operating my model, RapidMiner provides three dimensions of forecasting:
Graphical dimension – many default and advanced charts;
Quantitative – sampling of test data, predicted, clustered etc. values;
Performance – the result of calculated errors.
2. RapidMiner can be also useful for people, who are not data analysts and have no experience in complex machine learning algorithms. This tool has a simple and intuitive interface and enough tips and documentation for a beginner.