k-Nearest-Neighbor

Published by Daniel at July 1, 2022

K-nearest neighbor (KNN) is an algorithm used for data classification. It works based on the closest or “neighboring” training examples in a given region known as K.

The most common characteristic around the new input determines its category. The KNN algorithm runs by assuming that similar things exist in close proximity.

Its simplicity and effectiveness come from a supervised machine learning design that allows users to provide labeled and categorized training datasets so that the algorithm can classify unlabeled data.

K-nearest neighbor (KNN) is an algorithm used for data classification. It works based on the closest or “neighboring” training examples in a given region known as K. The most common characteristic around the new input determines its category. The KNN algorithm runs by assuming that similar things exist in close proximity. Its simplicity and effectiveness come from a supervised machine learning design that allows users to provide labeled and categorized training datasets so that the algorithm can classify unlabeled data.

How Does It Work

KNN evaluates a new input and classifies it by the most common characteristics shown by its neighbors in the dataset. When unlabeled information is found, the algorithm performs two operations:

First, it analyzes the closest training data (neighbors) and determines how many will be included in the procedure (K number). It is done by running the algorithm several times with different K numbers and evaluating the outcomes; the one with fewer errors is chosen for the next step.
Second, the algorithm determines the new data classification based on the selected neighbor’s characteristics.

After the number of neighbors (K) is determined, KNN uses the Euclidean distance to calculate the separation between the test and the training samples. This way, only the K nearest neighbors are included in the algorithm. The smaller the K value is, the less stable the results are, as fewer neighbors are considered for the classification. On the contrary, the more training data we include, the more the conclusion becomes stable. However, containing too much data might expose the algorithm to errors.

On figure 2, we can see how the K number can influence the outcome. If the algorithm chooses the three closest trained data (small dotted circle), the red rhombus will be classified with the green squares. However, with a K number of 9, the unlabeled data will be classified with blue circles

K-NN in comparison to other algorithms

The steps followed by the KNN in the classification process are considered less complex than other algorithms; thus, the KNN is used due to its simplicity and comprehensibility. This design can be applied for simple classification processes and regressions.

The main problem in KNN is evaluating all data and determining the K number; this action gets slower as the sample grows. Another factor to consider is the supervised learning technique which prevents the algorithm from learning as new unlabeled data is incorporated; thus, it can have poor run-time performance if the training set is large.

Conclusions

Automatic data classification has always represented a challenge for many scientists, healthcare providers, and researchers. The K-nearest-neighbors algorithm is a simple and easy-to-understand operation based on the supervised machine learning process. It uses an estimated number of data (K number) based on the distance between the test and the training sample.

It categorizes unlabeled input according to the most common characteristic of training data surrounding it. The main advantage of this technique is its simplicity; however, it becomes slower as a large number of information points are introduced.

Conclusions

References