K-nearest neighbor (KNN)
K-nearest neighbor (KNN) Theory:
K-Nearest Neighbor (KNN) is a simple and widely used machine learning algorithm for both classification and regression problems. It is based on the idea of finding the k-nearest neighbors of a given data point and making predictions based on their values.
For a classification problem, KNN assigns a new data point to the class that is most common among its k-nearest neighbors. For a regression problem, KNN predicts the value of a new data point based on the mean or median of the values of its k-nearest neighbors.
The algorithm works as follows:
-
Choose the number of nearest neighbors (k).
-
Calculate the distance between the new data point and all the training data points.
-
Find the k-nearest neighbors of the new data point based on the calculated distances.
-
For classification, assign the new data point to the class that is most common among its k-nearest neighbors.
-
For regression, predict the value of the new data point based on the mean or median of the values of its k-nearest neighbors.
KNN is considered a lazy learning algorithm because it does not make any assumptions about the underlying data distribution and does not build a model. Instead, it makes predictions based solely on the training data and the distances between the new data point and the training data points.
One of the advantages of KNN is its simplicity and ease of implementation. However, one of its disadvantages is that it can be computationally expensive and slow for large datasets, as the algorithm must calculate the distances between the new data point and all the training data points.
Here are some real-world examples of using K-Nearest Neighbor (KNN):
-
Recommendation systems: KNN can be used in recommendation systems to suggest items to users based on the items they have liked in the past. The algorithm works by finding the k-nearest neighbors of a user based on their past preferences and recommending items that are popular among those neighbors.
-
Image classification: KNN can be used in image classification to identify objects in an image based on their appearance and shape. The algorithm works by finding the k-nearest neighbors of an image patch in a training dataset of labeled images and assigning the label of the majority class among those neighbors to the image patch.
-
Handwriting recognition: KNN can be used in handwriting recognition to identify the digit written in an image of a handwritten digit. The algorithm works by finding the k-nearest neighbors of an image of a handwritten digit in a training dataset of labeled images of handwritten digits and assigning the label of the majority class among those neighbors to the image.
-
Credit scoring: KNN can be used in credit scoring to predict the creditworthiness of a borrower based on the credit history of similar borrowers. The algorithm works by finding the k-nearest neighbors of a borrower in a training dataset of labeled borrowers and assigning the label of the majority class among those neighbors to the borrower.
These are just a few examples of how K-Nearest Neighbor (KNN) can be applied in real-world problems. There are many other areas where KNN can be useful, including biology, geology, and engineering, among others.
Here's an implementation of K-Nearest Neighbor (KNN) algorithm in Python:
import numpy as np
from math import sqrt
from collections import Counter
def euclidean_distance(point1, point2):
"""Calculate the Euclidean distance between two points"""
distance = 0
for i in range(len(point1)):
distance += (point1[i] - point2[i]) ** 2
return sqrt(distance)
def knn(X_train, y_train, test_point, k):
"""Implementation of K-Nearest Neighbor algorithm"""
distances = []
for index, point in enumerate(X_train):
distance = euclidean_distance(point, test_point)
distances.append((distance, index))
distances = sorted(distances, key=lambda x: x[0])
distances = distances[:k]
indices = [index for _, index in distances]
closest_neighbors = [y_train[index] for index in indices]
return Counter(closest_neighbors).most_common(1)[0][0]
# Example usage
X_train = np.array([[1, 2], [3, 4], [5, 6]])
y_train = np.array([1, 2, 3])
test_point = np.array([2, 3])
k = 2
prediction = knn(X_train, y_train, test_point, k)
print("Prediction:", prediction)
In this example, the knn function takes four parameters: X_train is the training data, y_train is the labels for the training data, test_point is the point for which we want to make a prediction, and k is the number of nearest neighbors to consider. The function returns the most common label among the k-nearest neighbors of the test point.
The euclidean_distance function calculates the Euclidean distance between two points. The knn function first calculates the distances between the test point and all the training data points, then finds the indices of the k-nearest neighbors and returns the most common label among those neighbors.