In the healthcare field, most machine learning (ML) applications and algorithms are focusing more and more on answering precise clinical questions, such as prognostication (predicting outcomes), categorization (assigning patients to groups), detection (of outliers or abnormal findings) and dimensionality reduction (discard/gather variables according to relevance). Lately, there has been an exponential growth of data and the use of ML algorithms due to the improvement of big data and data mining techniques. With more information, ML techniques can be improved. But how can the algorithms of these techniques be evaluated? Does the quality of the instrument need to be tested? Yes, it should be. And by using a set of performance measurements, it can be done.
In the healthcare field, most machine learning (ML) applications and algorithms are focusing more and more on answering precise clinical questions, such as prognostication (predicting outcomes), categorization (assigning patients to groups), detection (of outliers or abnormal findings) and dimensionality reduction (discard/gather variables according to relevance). Lately, there has been an exponential growth of data and the use of ML algorithms due to the improvement of big data and data mining techniques.
With more information, ML techniques can be improved. But how can the algorithms of these techniques be evaluated? Does the quality of the instrument need to be tested? Yes, it should be. And by using a set of performance measurements, it can be done.(1,2,3)
A classifier is an algorithm that automatically orders or categorizes data into one or more of a set of classes.(1)Classifiers are evaluated based on performance metrics computed after the model-training process. While various classifiers are available for machine learning purposes, there has yet to be a general consensus on the usage of performance metrics. Usually, one or more performance metrics are chosen for classifier evaluation because they are commonly used in the community (this might lead to conflicting conclusions).
A classifier should be evaluated with a set of metrics that can capture a unique aspect of it. This becomes important because most performance metrics are based on the four values of the confusion matrix (number of true positives, number of true negatives, number of false positives, and number of false negatives).
Cross-validation: Frequently used in ML algorithms. It must be tested on a population to evaluate if a predictive algorithm works. Instead of testing it on a new population, an initial group of subjects is chosen and then split into training and testing datasets. This process of validating an algorithm allows numeric representation (number of correct/ incorrect predictions) and determines how precise/accurate they are, translating into sensitivity and specificity values.(1)
Cross-validation: Frequently used in ML algorithms. It must be tested on a population to evaluate if a predictive algorithm works. Instead of testing it on a new population, an initial group of subjects is chosen and then split into training and testing datasets.
This process of validating an algorithm allows numeric representation (number of correct/ incorrect predictions) and determines how precise/accurate they are, translating into sensitivity and specificity values.(1)
ROC curve: By plotting the effect of different levels of sensitivity on specificity, a curve that represents the performance of a particular predictive algorithm, can be made, allowing a quick understanding and utility of the algorithm. A specific operating point on the curve can be used for different tasks. A function of the ROC curve is to calculate the area beneath the ROC curve. The AUROC, or area under the ROC curve, is commonly quoted and offers a quick way to compare algorithms.(1)
Confusion matrix: Consists of a table layout (NxN) that allows visualization of algorithm performance, typically a supervised learning one (in unsupervised learning, it is usually called a matching matrix), where N is the number of target classes. It compares the actual target values with the ones predicted by the machine learning model. It gives an overview of how well the model performs and what errors it makes.(1)
Mean squared error (MSE): A measure of the degree that the regression line fits the data and reliably makes predictions is represented by MSE. It is calculated by applying the equation to the set of known variables to see how much variance from the regression line results. An error is a mean deviation away from the true values; if lower, then better. Multiple variants exist: the root mean squared error (RMSE), the mean absolute error (MAE), and the mean absolute percentage error (MAPE). All differ slightly in what they represent.(1)
As data sources and ways of processing data are developing rapidly, there still needs to be a consensus on how performance metrics can be measured with a specific classifier. Some studies investigate a handful of such metrics and compare them in the pursuit of finding the most appropriate one. But still, much research is needed.(2)