2017
November 26, 2023
Luxembourg
November 28, 2023

Training Process

The process behind Machine Learning consists of evaluating data, choosing a model, training, evaluating, and fine-tuning the model...

Machine Learning (ML) aims to simplify daily personal and industrial tasks. Its goal is to create a model based on one or more algorithms that can be applied to selected problems in a systematic approach. To this end, five essential steps can be considered: evaluating the data, choosing a model, training the model, evaluating the model, and fine-tuning the model. (1)

Evaluating the data

Data can be classified as structured, unstructured, semi-structured, and time-series (it can also be a mixture of all the previous three). Structured data tends to be more straightforward to work with (i.e., financial information, addresses, product information, etc.) but usually only accounts for about 20% of the information available. Most of the data available to work with tends to be unstructured (i.e., images, videos, audio, text, etc.) and needs to be processed. Semi-structured data has some internal tags that help with categorization, but this only accounts for 5-10% of all data (i.e., JavaScript Object Notation, Extensible Markup Language, etc.). The most difficult one to work with is time-series data, which is the information used for interactions (i.e., customer journey using a website, app, or store). Once the data is identified and compiled, it’s better to introduce it to machine learning algorithms in a randomized manner so they can detect patterns by themselves. (1)

Data can be classified as structured, unstructured, semi-structured, and time-series (it can also be a mixture of all the previous three). Structured data tends to be more straightforward to work with (i.e., financial information, addresses, product information, etc.) but usually only accounts for about 20% of the information available. Most of the data available to work with tends to be unstructured (i.e., images, videos, audio, text, etc.) and needs to be processed.

Semi-structured data has some internal tags that help with categorization, but this only accounts for 5-10% of all data (i.e., JavaScript Object Notation, Extensible Markup Language, etc.). The most difficult one to work with is time-series data, which is the information used for interactions (i.e., customer journey using a website, app, or store). Once the data is identified and compiled, it’s better to introduce it to machine learning algorithms in a randomized manner so they can detect patterns by themselves. (1)

Choosing a Model

To choose a model, a suitable algorithm needs to be selected. This algorithm is based upon a non-linear mathematical function that provides an output based on learned patterns presented in the initial (training) data2. There are features called hyperparameters that are established by a programmer prior to the training process and do not evolve during it, such as the architecture. On the other hand, trainable parameters are values randomly assigned in the beginning and then optimized during the training process.(1,2) Hundreds of ML algorithms can be used, but they are classified into four major categories: supervised learning, unsupervised learning, reinforcement learning, and semi-supervised learning, each used with specific types of data to accomplish different objectives. (1)

Training Process

During the training process, the training data (about 70% of the final complete dataset) is used to create relationships in the algorithm. Based on this, the training algorithm optimizes trainable parameters. The training algorithm includes a “loss function” that reveals the accuracy of the model by evaluating the magnitude of error of a model. The better models have smaller loss functions, and they achieve this by trying to find the best combination of trainable parameters. (2)

Evaluating and Fine-Tuning

Test data is put together (the remaining 30% of the dataset), and it should represent the range and type of information previously presented (training data). This helps determine if the algorithm is accurate and the results are consistent with expectations. Finally, some adjustments can be made to the parameters to see if better results are obtained. (1)

Contact Us