# Easily Implement Multiclass SVM From Scratch in Python | by Essam Wisam | Nov, 2023

Category:

Harness the Potential of AI Tools with ChatGPT. Our blog offers comprehensive insights into the world of AI technology, showcasing the latest advancements and practical applications facilitated by ChatGPT’s intelligent capabilities.

## Comes with a Free Deep Overview on SVMs  In this story, we shall implement the support vector machine learning algorithm in its general soft-margin and kernelized form. We will start by providing a brief overview of SVM and its training and inference equations, then subsequently translate these into code to develop the SVM model. Afterwards, we extend our implementation to handle multiclass scenarios and conclude by testing our model using Sci-kit Learn.

Thus, by the end of this story:

• You will gain a clear perspective of various important SVM concepts.
• You will be able to implement, with genuine comprehension, the SVM model from scratch for the binary and multiclass cases. Van Gogh Starry Night with a Line of Symmetry and Stars— Generated by author using DALLE 2

## Hard Margin SVM

The goal in SVM is to fit the hyperplane that would attain the maximum margin (distance from the closest points from in the two classes). It can be shown, and is intuitive that such hyperplane (A) has better generalization properties and is more robust to noise than a hyperplane that doesn’t maximize the margin (B).

In order to achieve this, SVM finds the hyperplane’s W and b by solving the following optimization problem:

It attempts to find W,b that maximizes the distance to the closest point and classifies everything correctly (as in the constraint where y takes ±1). This can be shown to be equivalent to the following optimization problem:

For which one can write the equivalent dual optimization problem

The solution to this yields a Lagrange multiplier for each point in the dataset which we assume to have size m: (α, α₂, …, α_N). The objective function is clearly quadratic in α and the constraints are linear which means that this can be easily solved with quadratic programming. Once the solution found, it follows from the derivation of the dual that:

Notice that only points with α>0 define the hyperplane (contribute to the sum). Those are called support vectors.

And thereby, the prediction equation, that when given a new example x, returns its prediction y=±1 is:

This basic form of SVM is called hard margin SVM because the optimization problem it solves (as defined above) enforces that all points in training must be classified correctly. In practical scenarios, there may be some noise that prevents or limits the existence of a hyperplane that perfectly separates the data in which case the optimization problem would return no or a poor solution.

## Soft Margin SVM Fit by Soft Margin SVM by Mangat et al on Research Gate. CC BY-SA 4.0 International

To generalize hard margin SVM, soft margin SVM adapts the optimization problem by introducing a C constant (user given hyperparamer) that controls how “hard” it should be. In particular, it modifies the primal optimization problem to become the following:

which allows each point to make some violation ϵₙ (e.g., be in the wrong side of the hyperplane) but still aims to reduce them overall by weighting their sum in the objective function by C. It becomes equivalent to hard margin as C approaches infinity (generally before it does). Meanwhile, a smaller C would allow more violations (in return for a wider margin; i.e., smaller wᵗw).

Quite surprisingly, it can be shown that the equivalent dual problem only changes by constraining α for each point to be ≤C.

Since violations are allowed, support vectors (points with α>0) are no longer all on the margin’s edge. It can be shown that any support vector that has committed a violation will have α=C and that non-support vectors (α=0) cannot commit violations. We call support vectors that potentially committed violations (α=C) “non-margin support vectors” and other pure ones (that have not committed violations; lie on the edge) “margin support vectors” (0<α<C).

It can be shown that the inference equation doesn’t change:

However, now (xₛ,yₛ) must now be a support vector that has not committed violations because the equation assumes it’s on the margin’s edge (previously, any support vector could be used).

## Kernel Soft Margin SVM

Soft Margin SVM extends the Hard Margin SVM to handle noise, but frequently, data is not separable by a hyperplane due to factors beyond noise, such as being naturally nonlinear. Soft Margin SVM can be used in cases like these, but then the optimal solution will probably involve a hyperplane that permits much more errors than is tolerable in reality.

Kernel Soft Margin SVM generalizes Soft Margin SVM to deal with situations where the data is naturally nonlinear. For instance, in the example shown on the left there is no linear hyperplane that Soft Margin SVM can find, regardless to the setting of C, that would decently separate the data.

It is possible, however, to map each point x in the dataset to a higher dimension via some transform function z=Φ(x) to make the data more linear (or perfectly linear) in the new higher dimensional space. This is equivalent to replacing x with z in the dual to get:

In reality, especially when Φ transforms into a very high-dimensional space, computing z can take a very long time. This is solved by the kernel trick which replaces the zz with an equivalent computation of a mathematical function (called kernel function) and which is much faster (e.g., an algebraic simplification of zz). For instance, here are some popular kernel functions (each of which corresponds to some transformation Φ to a higher dimensional space): Degree of polynomial (Q) and RBF γ are hyperparameters (decided by the user)

With this, the dual optimization problem becomes:

and intuitively, the inference equation becomes (after algebraic manipulation):

A full derivation of all the equations above assuming that you have the mathematical background can be found here.

For the implementation we will use

## Basic Imports

We start by importing some basic libraries:

`import numpy as np                  # for basic operations over arraysfrom scipy.spatial import distance  # to compute the Gaussian kernelimport cvxopt                       # to solve the dual opt. problemimport copy                         # to copy numpy arrays `

## Defining Kernels and SVM Hyperparameters

We start by defining the three kernels using their respective functions Degree of polynomial (Q) and RBF γ are hyperparameters (decided by the user)
`class SVM:linear = lambda x, xࠤ , c=0: x @ xࠤ.Tpolynomial = lambda x, xࠤ , Q=5: (1 + x @ xࠤ.T)**Qrbf = lambda x, xࠤ, γ=10: np.exp(-γ*distance.cdist(x, xࠤ,'sqeuclidean'))kernel_funs = {'linear': linear, 'polynomial': polynomial, 'rbf': rbf}`

For consistency with other kernels, the linear kernel takes an extra useless hyperparameter. As obvious, `kernel_funs` takes a string for the kernel and returns the corresponding kernel function.

Now let’s carry on by defining the constructor:

`class SVM:linear = lambda x, xࠤ , c=0: x @ xࠤ.Tpolynomial = lambda x, xࠤ , Q=5: (1 + x @ xࠤ.T)**Qrbf = lambda x, xࠤ, γ=10: np.exp(-γ*distance.cdist(x, xࠤ,'sqeuclidean'))kernel_funs = {'linear': linear, 'polynomial': polynomial, 'rbf': rbf}def __init__(self, kernel='rbf', C=1, k=2):# set the hyperparametersself.kernel_str = kernelself.kernel = SVM.kernel_funs[kernel]self.C = C                  # regularization parameterself.k = k                  # kernel parameter# training data and support vectors (set later)self.X, y = None, Noneself.αs = None# for multi-class classification (set later)self.multiclass = Falseself.clfs = []                          `

The SVM has three main hyperparameters, the kernel (we store the string given and the corresponding kernel function), the regularization parameter C and the kernel hyperparameter (to be passed to the kernel function); it represents Q for the polynomial kernel and γ for the RBF kernel.

## Define the Fit Method

To extend this class with `fit` and `predict` functions in separate cells, we will define the following function and use it as a decorator later:

`SVMClass = lambda func: setattr(SVM, func.__name__, func) or func`

Recall that fitting the SVM corresponds to finding the support vector α for each point by solving the dual optimization problem:

Let α be a variable column vector α₂ … α_N)ᵗ and let y be a constant column vector for the labels (y y₂ … y_N)ᵗ and let K be a constant matrix where K[n,m] computes the kernel at (xₙ, xₘ). Recall the following index-based equivalences for the dot product, outer product and quadratic form respectively:

to be able to write the dual optimization problem in matrix form as follow:

Knowing that this is a quadratic program as we hinted earlier, we can read over Quadratic Programming in CVXOPT’s documentation:

The square brackets tell you that you can call this with (P,q) only or (P,q,G,h) or (P, q, G, h, A, b) and so on (anything not given will be set by a default value such as 1).

To know the values for (P, q, G, h, A, b) in our case, we make the following comparison:

Let’s the comparison easier by rewriting the first as follows: Notice that we changed the max to min by multiplying the function by -1

It’s now obvious that (note that 0≤α is equivalent to -α≤0):

By this, we can write the following fit function:

`@SVMClassdef fit(self, X, y, eval_train=False):# if more than two unique labels, call the multiclass versionif len(np.unique(y)) > 2:self.multiclass = Truereturn self.multi_fit(X, y, eval_train)# if labels given in {0,1} change it to {-1,1}if set(np.unique(y)) == {0, 1}: y[y == 0] = -1# ensure y is a Nx1 column vector (needed by CVXOPT)self.y = y.reshape(-1, 1).astype(np.double) # Has to be a column vectorself.X = XN = X.shape  # Number of points# compute the kernel over all possible pairs of (x, x') in the data# by Numpy's vectorization this yields the matrix Kself.K = self.kernel(X, X, self.k)### Set up optimization parameters# For 1/2 x^T P x + q^T xP = cvxopt.matrix(self.y @ self.y.T * self.K)q = cvxopt.matrix(-np.ones((N, 1)))# For Ax = bA = cvxopt.matrix(self.y.T)b = cvxopt.matrix(np.zeros(1))# For Gx <= hG = cvxopt.matrix(np.vstack((-np.identity(N),np.identity(N))))h = cvxopt.matrix(np.vstack((np.zeros((N,1)),np.ones((N,1)) * self.C)))# Solve    cvxopt.solvers.options['show_progress'] = Falsesol = cvxopt.solvers.qp(P, q, G, h, A, b)self.αs = np.array(sol["x"])            # our solution# a Boolean array that flags points which are support vectorsself.is_sv = ((self.αs-1e-3 > 0)&(self.αs <= self.C)).squeeze()# an index of some margin support vectorself.margin_sv = np.argmax((0 < self.αs-1e-3)&(self.αs < self.C-1e-3))if eval_train:  print(f"Finished training with accuracy{self.evaluate(X, y)}")`

We ensure that this is a binary problem and that the binary labels are set as assumed by SVM (±1) and that y is a column vector with dimensions (N,1). Then we solve the optimization problem to find α₂ … α_N)ᵗ.

We use α₂ … α_N)ᵗ to get an array of flags that is 1 at any index corresponding to a support vector so that we can later apply the prediction equation by only summing over support vectors and an index for a margin support vector for (xₛ,yₛ). Notice that in the checks we do assume that non-support vectors may not have α=0 exactly, if it’s α≤1e-3 then this is approximately zero (we know CVXOPT results may not be ultimately precise). Likewise, we assume that non-margin support vectors may not have α=C exactly.

## Define the Predict Method

Recall the prediction equation is:

`@SVMClassdef predict(self, X_t):if self.multiclass: return self.multi_predict(X_t)# compute (xₛ, yₛ)xₛ, yₛ = self.X[self.margin_sv, np.newaxis], self.y[self.margin_sv]# find support vectorsαs, y, X= self.αs[self.is_sv], self.y[self.is_sv], self.X[self.is_sv]# compute the second termb = yₛ - np.sum(αs * y * self.kernel(X, xₛ, self.k), axis=0)# compute the scorescore = np.sum(αs * y * self.kernel(X, X_t, self.k), axis=0) + breturn np.sign(score).astype(int), score`

That’s it. We can also implement an `evaluate` method to compute the accuracy (used in fit above).

`@SVMClassdef evaluate(self, X,y):  outputs, _ = self.predict(X)accuracy = np.sum(outputs == y) / len(y)return round(accuracy, 2)`

## Test the Implementation

`from sklearn.datasets import make_classificationimport numpy as np# Load the datasetnp.random.seed(1)X, y = make_classification(n_samples=2500, n_features=5, n_redundant=0, n_informative=5, n_classes=2,  class_sep=0.3)# Test Implemented SVMsvm = SVM(kernel='rbf', k=1)svm.fit(X, y, eval_train=True)y_pred, _ = svm.predict(X)print(f"Accuracy: {np.sum(y==y_pred)/y.shape}")  #0.9108# Test with Scikitfrom sklearn.svm import SVCclf = SVC(kernel='rbf', C=1, gamma=1)clf.fit(X, y)y_pred = clf.predict(X)print(f"Accuracy: {sum(y==y_pred)/y.shape}")    #0.9108`

You can change the dataset and hyperparameter to further ensure that they are the same. Ideally, do so by comparing decision functions rather than accuracy.

## Generalizing Fit to Multiclass

`@SVMClassdef multi_fit(self, X, y, eval_train=False):self.k = len(np.unique(y))      # number of classes# for each pair of classesfor i in range(self.k):# get the data for the pairXs, Ys = X, copy.copy(y)# change the labels to -1 and 1Ys[Ys!=i], Ys[Ys==i] = -1, +1# fit the classifierclf = SVM(kernel=self.kernel_str, C=self.C, k=self.k)clf.fit(Xs, Ys)# save the classifierself.clfs.append(clf)if eval_train:  print(f"Finished training with accuracy {self.evaluate(X, y)}")`

To generalize the model to multiclass, over k classes. We train a binary SVM classifier for each class present where we loop on each class and relabel points belonging to it into +1 and points from all other classes into -1.

The result from training is k classifiers when given k classes where the ith classifier was trained on the data with the ith class being labeled as +1 and all others being labeled as -1.

## Generalizing Predict to Multiclass

Then to perform prediction on a new example, we choose the class for which the corresponding classifier is most confident (has the highest score)

`@SVMClassdef multi_predict(self, X):# get the predictions from all classifiersN = X.shapepreds = np.zeros((N, self.k))for i, clf in enumerate(self.clfs):_, preds[:, i] = clf.predict(X)# get the argmax and the corresponding scorereturn np.argmax(preds, axis=1), np.max(preds, axis=1)`

## Testing the Implementation

`from sklearn.datasets import make_classificationimport numpy as np# Load the datasetnp.random.seed(1)X, y = make_classification(n_samples=500, n_features=2, n_redundant=0, n_informative=2, n_classes=4, n_clusters_per_class=1,  class_sep=0.3)# Test SVMsvm = SVM(kernel='rbf', k=4)svm.fit(X, y, eval_train=True)y_pred = svm.predict(X)print(f"Accuracy: {np.sum(y==y_pred)/y.shape}") # 0.65# Test with Scikitfrom sklearn.multiclass import OneVsRestClassifierfrom sklearn.svm import SVCclf = OneVsRestClassifier(SVC(kernel='rbf', C=1, gamma=4)).fit(X, y)y_pred = clf.predict(X)print(f"Accuracy: {sum(y==y_pred)/y.shape}")    # 0.65`

Plotting the decision regions for each further leads to the following plot:

Beware that, although Sci-kit Learn’s SVM supports OVR by default (without an explicit call as shown above), that potentially also has further optimizations specific to SVM.

## Full Code

`import numpy as np                  # for basic operations over arraysfrom scipy.spatial import distance  # to compute the Gaussian kernelimport cvxopt                       # to solve the dual optimization problemimport copy                         # to copy numpy arrays class SVM:linear = lambda x, xࠤ , c=0: x @ xࠤ .Tpolynomial = lambda x, xࠤ , Q=5: (1 + x @ xࠤ.T)**Qrbf = lambda x, xࠤ , γ=10: np.exp(-γ * distance.cdist(x, xࠤ,'sqeuclidean'))kernel_funs = {'linear': linear, 'polynomial': polynomial, 'rbf': rbf}def __init__(self, kernel='rbf', C=1, k=2):# set the hyperparametersself.kernel_str = kernelself.kernel = SVM.kernel_funs[kernel]self.C = C                  # regularization parameterself.k = k                  # kernel parameter# training data and support vectorsself.X, y = None, Noneself.αs = None# for multi-class classificationself.multiclass = Falseself.clfs = []                                  # This is useless here (only for notebook)SVMClass = lambda func: setattr(SVM, func.__name__, func) or func@SVMClassdef fit(self, X, y, eval_train=False):if len(np.unique(y)) > 2:self.multiclass = Truereturn self.multi_fit(X, y, eval_train)# relabel if neededif set(np.unique(y)) == {0, 1}: y[y == 0] = -1# ensure y has dimensions Nx1self.y = y.reshape(-1, 1).astype(np.double) # Has to be a column vectorself.X = XN = X.shape# compute the kernel over all possible pairs of (x, x') in the dataself.K = self.kernel(X, X, self.k)# For 1/2 x^T P x + q^T xP = cvxopt.matrix(self.y @ self.y.T * self.K)q = cvxopt.matrix(-np.ones((N, 1)))# For Ax = bA = cvxopt.matrix(self.y.T)b = cvxopt.matrix(np.zeros(1))# For Gx <= hG = cvxopt.matrix(np.vstack((-np.identity(N),np.identity(N))))h = cvxopt.matrix(np.vstack((np.zeros((N,1)),np.ones((N,1)) * self.C)))# Solve    cvxopt.solvers.options['show_progress'] = Falsesol = cvxopt.solvers.qp(P, q, G, h, A, b)self.αs = np.array(sol["x"])# Maps into support vectorsself.is_sv = ((self.αs > 1e-3) & (self.αs <= self.C)).squeeze()self.margin_sv = np.argmax((1e-3 < self.αs) & (self.αs < self.C - 1e-3))if eval_train:  print(f"Finished training with accuracy {self.evaluate(X, y)}")@SVMClassdef multi_fit(self, X, y, eval_train=False):self.k = len(np.unique(y))      # number of classes# for each pair of classesfor i in range(self.k):# get the data for the pairXs, Ys = X, copy.copy(y)# change the labels to -1 and 1Ys[Ys!=i], Ys[Ys==i] = -1, +1# fit the classifierclf = SVM(kernel=self.kernel_str, C=self.C, k=self.k)clf.fit(Xs, Ys)# save the classifierself.clfs.append(clf)if eval_train:  print(f"Finished training with accuracy {self.evaluate(X, y)}")@SVMClassdef predict(self, X_t):if self.multiclass: return self.multi_predict(X_t)xₛ, yₛ = self.X[self.margin_sv, np.newaxis], self.y[self.margin_sv]αs, y, X= self.αs[self.is_sv], self.y[self.is_sv], self.X[self.is_sv]b = yₛ - np.sum(αs * y * self.kernel(X, xₛ, self.k), axis=0)score = np.sum(αs * y * self.kernel(X, X_t, self.k), axis=0) + breturn np.sign(score).astype(int), score@SVMClassdef multi_predict(self, X):# get the predictions from all classifierspreds = np.zeros((X.shape, self.k))for i, clf in enumerate(self.clfs):_, preds[:, i] = clf.predict(X)# get the argmax and the corresponding scorereturn np.argmax(preds, axis=1)@SVMClassdef evaluate(self, X,y):  outputs, _ = self.predict(X)accuracy = np.sum(outputs == y) / len(y)return round(accuracy, 2)from sklearn.datasets import make_classificationimport numpy as np# Load the datasetnp.random.seed(1)X, y = make_classification(n_samples=500, n_features=2, n_redundant=0, n_informative=2, n_classes=4, n_clusters_per_class=1,  class_sep=0.3)# Test SVMsvm = SVM(kernel='rbf', k=4)svm.fit(X, y, eval_train=True)y_pred = svm.predict(X)print(f"Accuracy: {np.sum(y==y_pred)/y.shape}")# Test with Scikitfrom sklearn.multiclass import OneVsRestClassifierfrom sklearn.svm import SVCclf = OneVsRestClassifier(SVC(kernel='rbf', C=1, gamma=4)).fit(X, y)y_pred = clf.predict(X)print(f"Accuracy: {sum(y==y_pred)/y.shape}")`

In summary, we implemented the support vector machine (SVM) learning algorithm, covering its general soft-margin and kernelized form. We provided an overview of SVM, developed the model in code, extended it for multiclass scenarios, and validated our implementation using Sci-kit Learn.

Hope you find what you have learnt from this story useful for your work. Till next time, au revoir.

Resources:

The code is mostly adaptations over the one present here (MIT License):