Movie Recommendation System

Collaborative Filtering

K Nearest Neighbors

In machine learning, KNN (K-Nearest Neighbors) plays an important role in classification and regression tasks. The major challenge when using KNN is choosing the right (best) value for k which is the number of neighbor instances considered for a new-instance classification. In technical terms, k is a hyperparameter in the KNN algorithm. The user needs to define its best value, as it can't learn the value from the input data.

K-NN in visualisation

Advantages:

  • Simple to implement
  • Robust to noisy training data.
  • Can be more effective with a large training dataset
  • Robust to noise and outliers if a large enough k is used.

Disadvantages :

  • Determining the value of k can be complex
  • High computation cost due to calculating the distance between data points for all training samples.
  • Slow algorithm, especially with large datasets.
  • Sensitive to the local structure of the data, which may lead to overfitting.

K Means Clustering

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.The objective of K-means is simple: group similar data points together and discover underlying patterns. To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset.The K-means algorithm starts with a first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids.

K Means Clustering in visualisation

Advantages:

  • Simple to implement.
  • It is scalable to a huge data set and also faster to large datasets.
  • Easily adapts to new examples.

Disadvantages:

  • Choosing k manually is a tough job.
  • As the number of dimensions increases its scalability decreases.
  • It is sensitive to outliers.

Logistic Regression Model

Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes). The goal of logistic regression is to find the best fitting model to describe the relationship between the dichotomous characteristic of interest (dependent variable) and a set of independent (predictor or explanatory) variables.

Logistic Regression Model Visualization

Advantages:

  • Simple to understand and interpret.
  • Efficient for binary classification problems.
  • Can handle both numerical and categorical data.

Disadvantages:

  • Assumes a linear relationship between independent variables and the log-odds of the dependent variable.
  • May perform poorly with outliers or irrelevant features.
  • Not suitable for complex relationships between variables.

Singular Value Decomposition (SVD)

Singular Value Decomposition is a method used to decompose a matrix into three other matrices such that when these matrices are multiplied, they approximate the original matrix. It is widely used in dimensionality reduction, data compression, and solving linear equations. SVD is especially useful in data analysis and machine learning for tasks like collaborative filtering, image compression, and feature extraction.

Advantages:

  • Provides a compact representation of the original matrix.
  • Useful for dimensionality reduction without significant loss of information.
  • Helps in identifying and removing noise from data.

Disadvantages:

  • Computational complexity increases with the size of the matrix.
  • Interpretability of the decomposed matrices may not always be straightforward.
  • May not work well with matrices that are sparse or have irregular patterns.

Naive Bayes

The Naïve Bayes classifier is a supervised machine learning algorithm that is used for classification tasks such as text classification. They use principles of probability to perform classification tasks. Naïve Bayes is also known as a probabilistic classifier since it is based on Bayes’ Theorem. This model predicts the probability of an instance belongs to a class with a given set of feature value. It is a probabilistic classifier. It is because it assumes that one feature in the model is independent of existence of another feature. In other words, each feature contributes to the predictions with no relation between each other. In real world, this condition satisfies rarely.

Conditional Probability Formula:


\[ P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} \]

where:

  • \( P(A|B) \) is the conditional probability of event \( A \) given \( B \)
  • \( P(B|A) \) is the conditional probability of event \( B \) given \( A \)
  • \( P(A) \) is the prior probability of event \( A \)
  • \( P(B) \) is the prior probability of event \( B \)

Advantages:

  • Handles categorical input well: Naive Bayes is particularly effective when dealing with categorical input variables, which are common in many real-world applications.
  • Works well with imbalanced datasets: Naive Bayes can handle imbalanced datasets effectively, as it does not rely on the presence of a large number of instances of each class.
  • Naive Bayes is a straightforward algorithm that requires minimal computational resources and can be easily understood and implemented.

Disadvantages:

  • The algorithm may not perform well with continuous data, as it requires the data to be discretized, which can lead to information loss..
  • Naive Bayes is not well-suited for problems with complex decision boundaries, as it assumes a simple, linear relationship between the features and the target variable.
  • The algorithm does not account for interactions between features, which can be important in some applications..

Support Vector Machine (SVM)

Support Vector Machine (SVM) is a powerful machine learning algorithm used for linear or nonlinear classification, regression, and even outlier detection tasks. SVMs can be used for a variety of tasks, such as text classification, image classification, spam detection, handwriting identification, gene expression analysis, face detection, and anomaly detection. SVMs are adaptable and efficient in a variety of applications because they can manage high-dimensional data and nonlinear relationships. SVM algorithms are very effective as we try to find the maximum separating hyperplane between the different classes available in the target feature.

Advantages:

  • SVM can be used to solve both classification and regression problems.
  • SVM can handle data like text, images, and trees effectively.
  • SVM only needs to store a subset of the training data (support vectors), making it memory efficient.

Disadvantages:

  • SVMs handle the dataset as a whole at once, requiring the whole data to be taken to the RAM of a computer during training.
  • SVM models can be less interpretable compared to other machine learning models, such as decision trees.

Linear Kernel

A linear kernel is a specific type of kernel function used in machine learning, particularly in the context of support vector machines (SVMs) and kernel methods. It is used to transform non-linearly separable data into a higher-dimensional space where it becomes linearly separable.

Advantages:

  • When the relationship between independent and dependent variables is known to be linear, the Linear Kernel is the best choice due to its lower complexity compared to other algorithms.
  • Over-fitting can be avoided in Linear Kernel methods using techniques like dimensionality reduction, regularization, and cross-validation.

Disadvantages:

  • Linear Kernel methods assume a linear relationship between dependent and independent variables, which may not always be accurate.
  • Linear Kernel methods might not perform well with non-linear datasets, as they are designed for linearly separable data.

Polynomial Kernel

In machine learning, the polynomial kernel is a kernel function commonly used with support vector machines (SVMs) and other kernelized models, that represents the similarity of vectors (training samples) in a feature space over polynomials of the original variables, allowing learning of non-linear models.

Advantages:

  • They allow learning of non-linear models by representing the similarity of vectors in a feature space over polynomials of the original variables.
  • Polynomial kernels can capture complex relationships between features, making them suitable for problems with non-linear decision boundaries.

Disadvantages:

  • Polynomial kernels may not perform well when the training data is not normalized, as they can be sensitive to the scale of the input features.
  • Choosing the appropriate degree of the polynomial can be challenging, as a higher degree may lead to overfitting, while a lower degree may not capture the complexity of the data.

Random Forest

Random Forest is an ensemble learning method that constructs a multitude of decision trees during training and outputs the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. It is one of the most versatile and widely used machine learning algorithms, known for its high accuracy, robustness, and ability to handle large datasets with high dimensionality. Random Forest can be applied to both classification and regression tasks across various domains.

Random Forest Visualization

Advantages:

  • Highly accurate and robust due to the ensemble of decision trees.
  • Handles both categorical and numerical data.
  • Automatically handles missing values and maintains accuracy for missing data.

Disadvantages:

  • Complexity and computational overhead due to the construction of multiple decision trees.
  • May overfit noisy data if not properly tuned.
  • Less interpretable compared to simpler models like decision trees.

RBF Kernel

The Radial Basis Function (RBF) kernel is a popular kernel function used in Support Vector Machines (SVMs) for classification and regression tasks. It is a non-linear kernel that transforms input data into a higher-dimensional space, where it becomes linearly separable. The RBF kernel computes the similarity between two data points based on the Euclidean distance between them, using a Gaussian function. Mathematically, the RBF kernel can be expressed as exp(-γ * ||x - x'||^2), where γ is a hyperparameter that controls the kernel's smoothness and ||x - x'||^2 is the squared Euclidean distance between the input vectors x and x'.

Advantages:

  • Non-linearity for complex data.
  • Versatility across various applications.
  • Captures local patterns effectively.

Disadvantages:

  • Hyperparameter tuning complexity.
  • Prone to overfitting.
  • High computational cost for large datasets.