Dive into the World of Data Mining! Part 1: Introduction to RapidMiner
Rapid Data Growing
The amount of data being created and harvested by organizations and private individuals is growing exponentially. This trend can be clearly seen in the steady growing IoT, but also other industries can access a huge range of data sources – free public or private available for a subscription fee.
This volume increase of data, brings new challenges for analysts and specialists, working on optimization of business tasks. The pace of development in the global economy is increasing, but a quick response to changes on the micro level allows individual companies to expand. And to aid that, there are tools for data analysis and machine learning.
Nowadays many companies are in need of system analysts. The high costs, lack of experience and, in most cases, the excessive complexity of the software on hand and high costs for employee training, maintaining expensive data processing and storage systems on the other, forcing them to abandon the idea of building their own analytical system, in favor of a much simpler Excel-based solution.
There is a distinctive lack of open source solutions for data mining and data analytics, but one of the most decent, efficient and free, software solutions is RapidMiner Studio. A tool created for data mining, with the basic idea, that the analyst does not require to have good programming skills. To make the data mining process more transparent and smooth, it has a good set of predefined operators solving a wide range of problems. They can also obtain and process information from various sources, for example: databases, local files, etc. On top of that RapidMiner is a complete tool for ETL processes.
Besides the more than the 400 analytic functions, there is also the RapidMiner Server, which can be used as a (Cloud) repository for storing and executing miner processes (including a schedule). The server has a Web-Interface to manage connections to data sources and giving details of the miner processes.
First steps with RapidMiner
From 1950 to 2015 over 696,226 people lost their lives on German roads. The record year marks 1970: 19.193 people died in traffic, more than half a million were injured and sadly many of them remained crippled. To date, the number of car accidents in Germany has steadily declined almost every year. This decrease may have several causes: technological more advanced cars, modern security measures in and around the vehicles, better road quality, speed limits, seat belt usage required by law, lowering the alcohol limit and others. But which of these innovations are the most effective? Is there any correlation between year, month and amount or art of car accidents? Is it possible to predict the car accidents amount in next period?
Let’s analyze the data set from statistical office Munich with monthly figures of traffic accidents. The data set you can find here: https://www.opengov-muenchen.de/dataset/monatszahlen-verkehrsunfaelle/resource/40094bd6-f82d-4979-949b-26c8dc00b9a7 and try to ask the aforementioned questions.
Let’s get acquainted with RapidMiner:
On the left side of the screen you can see a data and process repository panel and the operators. RapidMiner provides the ability to load data or processes from a database or cloud storage (Amazon S3, Azure Blob, Dropbox ).
For convenience, Operators are divided into categories:
- access to the data (job files, databases, cloud storage, Twitter streams, Salesforce);
- operators to work with attributes (transformation of types, dates, set operations, etc);
- mathematical modeling operators (predictive models, cluster analysis model optimization models);
- auxiliary operators (run Java and Groovy-routines, data anonymization, sending e-mail messages, event planner).
Those are the main categories, each of which has its own subcategories and variations of operators. It is possible to add new operators with the ever-growing RapidMiner Marketplace. For example, among the available extensions there is an operator that converts data sets into time series.
The central part of the screen is the workspace to create a data conversion process. Using drag and drop we can add, change or remove, the data sources and the operators for data conversion, to or from our process. To specify communication between the operators, we set the implementation and parameters of the process. At the bottom of the middle panel are the tips – based on processes built by other users, RapidMiner provides you there with recommendations regarding the use of operations. At the right panel you can see the detailed record parameters and operation principles of the selected operator.
Traffic accidents in Munich… Try yourself as data miner
First of all, download the data (see Figure 2) then drag and drop or use the operator retrieve function to load the data.
As a result you can see your data as a table if you click on button Results. If it’s necessary you can change the data types or names of attributes with Import Wizard.
Then run the process and you will see the result of loading of your data (in this case the load of the csv-file). The data represents the number of queries to the end of the month from 2000 to 2017. When you import data, you must set the date format to the correct form in the time schedules. After that, connect the output data block with the output point of the results (res). Now you can press “start” and the program will show the overall statistics. The results are summarized in Figure 4.
With the use of the charts tab construct the graph of the data distribution.
The first graph shows us that the most of accidents were caused by driving error (brown), the other two groups of accidents are caused by drunk driving (blue) or escaping from the police (green). (Figure 5)
As can be seen, you can visualize automatically the data in statistic blocks or just use other chart styles. Moreover, you can customize your chart and choose other colors, dimensions, styles etc.
Also it is possible to build 3- or more dimensional complex charts, like this one:
Here you can see the months and years in x-Axis, the amount of accidents in y-Axis and the kind of accidents is color marked as in figure 5. The bubble size defines the amount of people that died.
In conclusion, 4-dimensions modeling in RapidMiner is quite easy. Even if you are not data analyst and have no experiences in data mining or statistic, you can intuitive find the good graphical solution for your data.
P.S. In “Part 2: Building a model” we will dive into the Data Mining world deeper and build our first prediction model.