It is easy to look at machine learning as a magical black box, in which you insert data and make predictions. With that, there is nothing magical about machine learning, writes IDG News. In fact, it is important to understand how the different parts of machine learning work, to get better results. So, join us on a tour.
As in many other IT contexts, such as devops, the term “pipeline” is used in machine learning. It is a visual parable of how data flows through a solution. The pipeline can be roughly divided into four parts:
- Collect data, called a little funny for “ingesting” (inta) in English.
- Prepare data, such as data wash and normalization if needed. Normalization in this context should not be confused with normalization of relational databases, but it is about adapting different value scales to each other.
- Model training.
- Provide predictions.
Here are more detailed descriptions of the four phases:
Decide on data
Two things are needed to get started with machine learning: data to train a model and algorithms that control training. Data can come from different sources. This is often about data from any business process that is already being collected, either continuously or in archived form.
In some cases, you have to work with streaming data. Then you can choose between managing data streaming or first storing it in a database. In the case of streaming data management, there is another choice between two options: Either you use new data to fine-tune an existing model or you build new models from time to time and train them with new data.
These decisions affect the choice of algorithms. Some algorithms are suitable for fine-tuning models, others not. In the latter case, you may start with new data.
Data washing is often about scales
There can be a lot of confusion in the data that is taken from a lot of different sources. One thing that often needs to be arranged is to normalize the data, ie to convert different data values to the same scale.
A simple example is that 2.45 meters in high jump can be considered as worthwhile as 8.95 meters in long jump, as both are world records. In order to understand that the values are equally valuable, they need to be converted, normalized, for example to 1.0 in both cases.
But in some cases normalization is not appropriate. It applies whether the scale actually matters. If you want to compare female and male height jumpers, it may be appropriate to normalize so that 2.45 meters for men will have the same value as 2.09 meters for women, as both are world records. But if you want to compare height jumpers regardless of gender then you should not normalize the values.
During the data preparation phase, it is also important to analyze how bias can affect models. This may include, for example, how to select data to use or how to normalize data.
Time for hard training
The next phase is the actual training of a model. It involves using data to generate a model from which predictions can be made. The key activity during training is to make settings, which is called “hyperparameterization” in English.
A hyperparameter is a setting that controls how a model is created based on an algorithm. A very simple example is if you want to divide a number of worlds into categories. In that case, a hyperparameter can be the number of categories you want. One way to arrive at good hyperparameters is to simply try them out. But in some cases, these settings can be optimized automatically.
Sometimes the training can be run in parallel on several processors, which of course provides performance benefits. It doesn’t have to be different processors, but you talk about workers. Workers in this case are simply different copies of a program that runs at the same time in different places.
The parallelization can mainly be done in two different ways: first, different “workers” can work with different parts of a data set, and different “workers” can work with different parts of the model.
Time for delivery
The final phase is to use the pre-trained model, which can be called the “predict and deliver” phase. Now you run the model on new data to generate a prediction. For example, if it is about face recognition, then incoming data is a digital image of a face. Based on training with other images on the faces, the model can now make new predictions. How you handle all the different activities in the different phases, or the different parts of the pipeline, varies. Using cloud services increases the chance of handling multiple parts in the same place, such as training data, pre-trained models, and so on.
In some cases, decisions must be made in cases where the different parts should be handled on servers or client devices. One advantage of running processing on a client, such as a smart mobile, is that accessibility is increased for the user. One potential disadvantage is the poor quality of the prediction, as there are less hardware resources, another poor performance, thus it takes longer to generate a prediction.
Iterative working method
To illustrate the whole flow of machine learning with a pipeline, ie a pipe, is a bit misleading. It is often about iterative work, that is, certain phases are repeated and refined. The type example is that a model is trimmed with new data.
The advantage of thinking of a pipeline with delimited parts is that it becomes easy to focus on the different parts as delimited areas that work in different ways.
A general observation that machine learning is actually as good can be called data analysis, or even math, as AI. What you call machine learning for AI may be because it is a technology that makes it possible to draw conclusions that humans, at least in most cases, cannot.