Support Vector Machine

Our main focus in this course is to study deep learning techniques. However, we take a slight deviation in this chapter to briefly discuss the core idea behind another similar ML method called the Support Vector Machines (SVMs). In particular, we highlight the geometric foundations of SVMs, which provide a useful perspective for comparing the learning methodologies of SVMs and artificial neural networks.

We demonstrate the SVM method for linearly separable datasets in the first Section «Click Here», where we derive primal optimization problems whose solution is the weight vector for the optimal separating hyperplane. We also derive the dual optimization problem and discuss its advantages when compared to primal problem. As we did in the previous chapter, we then go beyond linearly separable datasets in the last Section «Click Here», where we discuss kernel tricks to construct SVM for a dataset which is nonlinearly separable.

Note:

Refer to Chapter 12 (page 370) in the following book:

Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong, Mathematics for Machine Learning, 2020.

Click here for the pdf

Linearly Separable Dataset

In the preceding section, we studied a perceptron, which is a basic model in deep learning architectures. Recall that the perceptron algorithm updates its weight vector based on misclassified examples, and the algorithm converges if and only if the dataset is linearly separable. However, when multiple separating hyperplanes exist, the perceptron may converge to any one (see Figure-(a) below) of them without regard to its generalization ability.

On the other hand, SVMs are classification methods in the broader field of machine learning that also seek a linear decision boundary (possibly in a transformed feature space) but are based on a different learning principle from the perceptron. Rather than updating the weight vector based on individual misclassifications, an SVM selects the one that maximizes the margin between the two classes among all possible separating hyperplanes (see Figure-(b) below). This margin-based criterion leads to an optimal separating hyperplane, called the SVM solution, with a relatively strong generalization ability.

In this subsection, we briefly discuss the learning procedure of SVMs for linearly separable dataset, while postponing the treatment of the nonlinearly separable case to the next subsection.

Hard and Soft Margins

Let us start with the definition of the margin between two class of examples in a given linearly separable dataset.

An illustration of a linearly separable dataset where red dots represents labels with value \(-1\) and green dots corresponds to labels with value +1. (a) Each dotted line represents a separating lines (b) Solid line represents the optimal separating line (the SVM solution), and the dotted lines indicate lower and upper margins.

Definition:

[Margin]

Consider a linearly separable dataset

\[ \mathcal{D}=\{(\boldsymbol{x}_k, y_k)~|~k=1,2,\ldots, N\}\subset \mathbb{R}^n\times \{-1,1\}. \]

Let the decision boundary be defined by the hyperplane

\[ H = \{\boldsymbol{x}\in \mathbb{R}^n~|~\boldsymbol{w}\cdot \boldsymbol{x} = b\} \]

for a given weight vector \(\boldsymbol{w}\in \mathbb{R}^n\) and bias \(b\in \mathbb{R}\). The geometric margin (or simply margin) \(\rho\) of the hyperplane \(H\) with respect to the dataset \(\mathcal{D}\) is defined as

\[ \rho = \min_{k\in \{1,2,\ldots, N\}} \frac{|\boldsymbol{w}\cdot \boldsymbol{x}_k - b|}{\|\boldsymbol{w}\|_2}. \]

Clearly the margin of a given hyperplane is the distance between the hyperplane and the nearest example in \(\mathcal{D}\). The SVM solution is the separating hyperplane that maximizes the margin as illustrated in the above Figure-(b).

A separating hyperplane (also referred to as decision boundary) classifies all the examples of \(\mathcal{D}\) correctly. This is equivalent to the condition that

\[ \begin{array}{ccc} \boldsymbol{w}\cdot \boldsymbol{x}_k - b \ge 0 &\implies& y_k = +1\\ \boldsymbol{w}\cdot \boldsymbol{x}_k - b < 0 &\implies& y_k = -1. \end{array} \]

The above conditions can be combined to obtain the following lemma.

Lemma:

A hyperplane

\[ H = \{\boldsymbol{x}\in \mathbb{R}^n~|~ \boldsymbol{w}\cdot \boldsymbol{x} = b\}, \]

is a separating hyperplane of a dataset \(\mathcal{D}\) if and only if

\begin{eqnarray} y_k \big(\boldsymbol{w}\cdot \boldsymbol{x}_k - b\big) \ge 0,~\text{for all}~\boldsymbol{x}_k\in \mathcal{D}. \end{eqnarray}

(3.1)

Note:

Observe the difference between the label values used in the perceptron and those used in the SVMs. In perceptrons, we used the binary label set \(\{0,1\}\), whereas in the SVM framework, we use the bipolar label set \(\{-1,1\}\). As already noted, we could have also used the bipolar activation function

\[ H(x) = \left\{\begin{array}{rc} -1,&\text{if}~x<0\\ 1,&\text{if}~x\ge 0 \end{array}\right. \]

The difference between using either the binary or the bipolar labels is precisely in the condition given in the above lemma and in the update rule of the perceptron given in the previous chapter. If binary labels are used, then the condition in the above lemma takes the form \(\Delta_k \big(\boldsymbol{w}\cdot \boldsymbol{x}_k - b\big) \ge 0\), where \(\Delta_k\) is defined in multiple epoch perceptron algorithm. On the other hand, if bipolar labels are used along with the bipolar activation function in the perceptron learning, then the update rule involves \(y_k\) directly, in place of \(\Delta_k\).

The following lemma is a consequence of the Definition «Click Here» on margin, and the condition given in the above lemma.

Lemma:

Let \((b, \boldsymbol{w})\) be such that the hyperplane

\[ H=\{\boldsymbol{x}\in \mathbb{R}^n~|~ \boldsymbol{w}\cdot \boldsymbol{x} = b\} \]

is a separating hyperplane of a dataset

\[ \mathcal{D}=\{(\boldsymbol{x}_k, y_k)~|~k=1,2,\ldots, N\}\subset \mathbb{R}^n\times \{-1,1\}. \]

If \(\rho\) is the margin of \(H\) for the dataset \(\mathcal{D}\), then

\[ y_k \big(\boldsymbol{w}\cdot \boldsymbol{x}_k - b\big) \ge \rho\, \|\boldsymbol{w}\|_2,~~k=1,2,\ldots, N. \]

In view of the above lemma, the SVM solution of a linearly separable dataset can be obtained as the optimal separating hyperplane of the dataset.

Definition:

[Optimal Separating Hyperplane]

For a given linearly separable dataset \(\mathcal{D}\), the optimal separating hyperplane is defined as the hyperplane

\[ H^* = \{\boldsymbol{x}\in \mathbb{R}^n~|~\boldsymbol{w}^*\cdot\boldsymbol{x}=b^*\}, \]

where \((b^*, \boldsymbol{w}^*)\) is a maximizer of the constrained optimization problem

\begin{eqnarray} \left.\begin{array}{ll} &\displaystyle{\max_{(b, \boldsymbol{w})}}~ \rho,\\ \text{subject to:}& \left\{\begin{array}{l} y_k (\boldsymbol{w}\cdot \boldsymbol{x}_k - b)\ge \rho,~\text{for all}~ (\boldsymbol{x}_k, y_k)\in \mathcal{D},\\ \|\boldsymbol{w}\|=1,~ \end{array}\right. \end{array}\right\} \end{eqnarray}

(3.2)

where \(\rho\) is as given in Definition «Click Here» .

The optimization problem (3.2) can be expressed equivalently as a convex optimization problem in the weight vector \(\boldsymbol{w}\), which is strictly convex and therefore admits a unique global solution for \(\boldsymbol{w}.\)

Theorem:

[Hard Margin SVM]

Consider the convex optimization problem

\begin{eqnarray} \left.\begin{array}{ll} & \displaystyle{\min_{(b, \boldsymbol{w})}}~ \frac{1}{2}\|\boldsymbol{w}\|_2^2,\\ \text{subject to:}& y_k \big( \boldsymbol{w}\cdot\boldsymbol{x}_k - b \big)\ge 1,~\text{for all}~ (\boldsymbol{x}_k, y_k)\in \mathcal{D}, \end{array}\right\} \end{eqnarray}

(3.3)

where

\[ \mathcal{D}=\{(\boldsymbol{x}_k, y_k)~|~k=1,2,\ldots, N\}\subset \mathbb{R}^n\times \{-1,1\} \]

is a linearly separable dataset.

Let \((b^*, \boldsymbol{w}^*)\) be a maximizer of (3.2) with the corresponding margin \(\rho^*>0\), and let \((b_*,\boldsymbol{w}_*)\) be a minimizer of (3.3). Then the following statements hold:

There exists an \(\alpha\in \mathbb{R}\) such that \((\alpha b^*, \alpha\boldsymbol{w}^*)\) is a minimizer of (3.3).
There exists a \(\beta\in \mathbb{R}\) such that \((\beta b_*, \beta\boldsymbol{w}_*)\) is a maximizer of (3.2).

In other words, the constrained optimization problems (3.2) and (3.3) are equivalent. The minimizer of (3.3) is called the hard margin SVM of the given linearly separable dataset \(\mathcal{D}\).

Proof:

\(~\)

Let \(\tilde{\boldsymbol{w}}^* = \boldsymbol{w}^*/\rho^*\) and \(\tilde{b}^* = b^*/\rho^*\).
Since \((b^*,\boldsymbol{w}^*)\) is a maximizer of (3.2), we have \( y_k (\boldsymbol{w}^*\cdot \boldsymbol{x}_k - b^*)\ge \rho^*. \) Since \(\rho^*>0\), we can write the inequality as
\begin{eqnarray} y_k \left( \frac{\boldsymbol{w}^*}{\rho^*} \cdot \boldsymbol{x}_k - \frac{b^*}{\rho^*} \right) \ge 1. \end{eqnarray}
(3.4)
Therefore, \((\tilde{b}^*, \tilde{\boldsymbol{w}}^*)\) satisfies the constraint of the problem (3.3). Since \(\boldsymbol{w}^*\) is an unit vector, we have
\[ \|\tilde{\boldsymbol{w}}^*\| = \left\|\frac{\boldsymbol{w}^*}{\rho^*}\right\| = \frac{1}{\rho^*}. \]
Since \(\displaystyle{\text{argmin}_{(b, \boldsymbol{w})}} \frac{1}{\rho^*} = \text{argmax}_{(b, \boldsymbol{w})}{\rho^*},\) we see that
\begin{eqnarray} (\tilde{b}^*, \tilde{\boldsymbol{w}}^*) &=& \left\{ \begin{array}{ll} &\displaystyle{\text{argmin}_{(b, \boldsymbol{w})}} ~ \|{\boldsymbol{w}}^*\|\\ \text{subject to:}& y_k \big( \boldsymbol{w}\cdot\boldsymbol{x}_k - b \big)\ge 1,~\text{for all}~ \boldsymbol{x}_k\in \mathcal{D} \end{array}\right.\\ &=& \left\{ \begin{array}{ll} &\displaystyle{\text{argmin}_{(b, \boldsymbol{w})}} ~ \frac{1}{2}\|{\boldsymbol{w}}^*\|^2\\ \text{subject to:}& y_k \big( \boldsymbol{w}\cdot\boldsymbol{x}_k - b \big)\ge 1,~\text{for all}~ \boldsymbol{x}_k\in \mathcal{D}. \end{array}\right. \end{eqnarray}
(3.5)
Thus, we have proved the first condition of the theorem with \(\alpha = 1/\rho^*\).
Let us take \(\beta=1/ \|\boldsymbol{w}_*\|\) and define \(\hat{\boldsymbol{w}}_* = \frac{\boldsymbol{w}_*}{\|\boldsymbol{w}_*\|}\) and \(\hat{b}_* = \dfrac{b_*}{\|\boldsymbol{w}_*\|}\). The claim that \((\hat{b}_*, \hat{\boldsymbol{w}}_*)\) is a maximizer of (3.2) is left as an exercise.

Remark:

[Support Vectors]

Let \((b^*, \boldsymbol{w}^*)\) be the hard margin SVM. If an example \((\boldsymbol{x}_k, y_k)\in \mathcal{D}\) is such that

\[ \boldsymbol{w}^* \cdot \boldsymbol{x}_k - b^* = \pm 1, \]

then the vector \(\boldsymbol{x}_k\) is said to be a support vector. These are the vectors which are very close to the optimal separating hyperplane.

Problem:

Suppose a hard-margin SVM in \(\mathbb{R}^2\) has exactly four support vectors: two vectors \(\boldsymbol{x}_1,\boldsymbol{x}_2\) from the class \(+1\) and two vectors \(\boldsymbol{x}_3,\boldsymbol{x}_4\) from the class \(-1\). Show that the line through \(\boldsymbol{x}_1\) and \(\boldsymbol{x}_2\) is parallel to the line through \(\boldsymbol{x}_3\) and \(\boldsymbol{x}_4\).

Does the result hold in \(\mathbb{R}^3\)?

Remark:

[Soft margin SVM]

There are broadly two reasons why a dataset may fail to be linearly separable. One is when the decision boundary is not a hyperplane. In such cases, we use a feature space transformation, as discussed in Section«Click Here», where basis functions may be implicitly approximated using kernels, which we discuss in the next subsection.

The second scenario is when the dataset contains outliers or noise, making a perfect separation impossible. In such cases, one has to allow some classification errors while maximizing the margin. This leads to the concept of a soft margin SVM whose solution is obtained as the minimizer of the problem

\begin{eqnarray} \left.\begin{array}{ll} & \displaystyle{\min_{(b, \boldsymbol{w}), \boldsymbol{\xi}}}~ \left[ \frac{1}{2}\|\boldsymbol{w}\|_2^2 + C \sum_{k=1}^N \xi_k \right],\\ \text{subject to:}& \left\{\begin{array}{cl} y_k \big( \boldsymbol{w}\cdot\boldsymbol{x}_k - b \big)\ge 1-\xi_k,&\text{for all}~ (\boldsymbol{x}_k,y_k)\in \mathcal{D}, \\ \xi_k\ge 0,~k=1,2,\ldots, N. \end{array}\right. \end{array}\right\} \end{eqnarray}

(3.6)

Here \(C>0\) is a tuning parameter called the regularization parameter, and \(\boldsymbol{\xi} = (\xi_1,\xi_2,\ldots, \xi_N)\) represents the margin errors. Each variable \(\xi_k\), \(k=1,2,\ldots, N\), referred to as a slack variable, quantifies the extent to which the point \(\boldsymbol{x}_k\) violates the margin constraint. This allows some tolerance in classification including the possibility of lying on the wrong side of the decision boundary.

Note:

For a soft margin SVM, the support vectors are those which are the closest to the margin (as in the hard margin case, see Remark «Click Here» ), those which lie inside the margin, and those which are misclassified.

Equivalently, the support vectors are precisely the training points \((\boldsymbol{x}_i, y_i)\) for which

\[ y_i \big( \boldsymbol{w} \cdot \boldsymbol{x}_i - b \big) = 1 - \xi_i. \]

Loss-function form

We now present the equivalence between the quadratic optimization problem (3.6) and an unconstrained optimization problem that involves a loss function based on the hinge loss.

We can eliminate the slack variables \(\boldsymbol{\xi}\) and write the constraints in (3.6) directly in terms of a hinge loss as

\[ \ell_{\text{hinge}}(h(\boldsymbol{x}), y) = \max\big(0,\, 1 - y h(\boldsymbol{x})\big), \]

where \(h(\boldsymbol{x}) = \langle \boldsymbol{w}, \boldsymbol{x} \rangle - b\) and \(y\in \{-1, 1\}\). Then the soft margin SVM objective (3.6) can be written as

\begin{eqnarray} \min_{(b, \boldsymbol{w})} \left[ \frac{1}{2} \| w \|^2 \;+\; C \hat{R}^{\text{hinge}}(b, \boldsymbol{w}) \right], \end{eqnarray}

(3.7)

where

\begin{eqnarray} \hat{R}^{\text{hinge}}(b, \boldsymbol{w})\;=\; \frac{1}{N} \sum_{k=1}^N \ell_{\text{hinge}}\big(h(\boldsymbol{x}_k),\, y_k\big) \end{eqnarray}

(3.8)

is the empirical risk based on the hinge loss. The first term in (3.7) is called the regularizer.

Problem:

Show that the primal soft margin SVM problem (3.6) is equivalent to the unconstrained optimization problem (3.7)-(3.8).

Subgradient Descent Algorithm

In this subsection, we outline an algorithm based on the subgradient descent method for the soft margin problem with hinge loss given by (3.7)-(3.8).

Let us define the cost function as

\[ \mathcal{C}(b,\boldsymbol{w}) := \frac{\lambda}{2}\|\boldsymbol{w}\|^2 \;+\; \frac{1}{N} \sum_{k=1}^N \ell_{\text{hinge}}(\boldsymbol{x}_k, y_k;b,\boldsymbol{w}), \]

where \(\lambda\) is the regularization parameter, and the hinge loss \(\ell_{\text{hinge}}\) is given by

\[ \ell_{\text{hinge}}(\boldsymbol{x}, y;b,\boldsymbol{w}) = \max\big(0, \, 1 - y (\langle \boldsymbol{w}, \boldsymbol{x} \rangle - b)\big). \]

The aim is to compute the minimizer of the cost function. That is, to find

\[ (b^*, \boldsymbol{w}^*) = \text{argmin}_{(b,\boldsymbol{\boldsymbol{w}})} \mathcal{C}(b,\boldsymbol{w}). \]

Since both the regularizer and the empirical risk are convex with respect to \((b,\boldsymbol{w})\), the cost function is convex. Moreover, since the regularizer is strictly convex in \(\boldsymbol{w}\), for \(\lambda>0\), the optimal weight vector \(\boldsymbol{w}^*\) is unique.

We use the gradient descent method to compute the global minimizer of the cost function. Since the hinge loss function is not differentiable, we use subgradients and the resulting version of the gradient descent method is called the subgradient descent method. The subgradient of the hinge loss function with respect to \(\boldsymbol{w}\) is given by

\[ \nabla_{\boldsymbol{\boldsymbol{w}}} \ell_{\text{hinge}}(\boldsymbol{x}, y;b,\boldsymbol{w}) = \left\{\begin{array}{cc} - y \boldsymbol{x}, & \text{if } y(\langle \boldsymbol{w}, \boldsymbol{x}\rangle - b) < 1, \\ 0, & \text{if }y(\langle \boldsymbol{w}, \boldsymbol{x}\rangle - b) \ge 1, \end{array}\right. \]

and the subgradient with respect to \(b\) is given by

\[ \frac{\partial }{\partial b} \ell_{\text{hinge}}(\boldsymbol{x}, y;b,\boldsymbol{w})= \begin{cases} y, & \text{if } y(\langle \boldsymbol{w}, \boldsymbol{x}\rangle - b) < 1, \\ 0, & \text{if }y(\langle \boldsymbol{w}, \boldsymbol{x}\rangle - b) \ge 1. \end{cases} \]

Therefore, the subgradient of the cost function with respect to \(\boldsymbol{w}\) and \(b\) are given, respectively, by

\begin{eqnarray} \nabla_{\boldsymbol{\boldsymbol{w}}} \mathcal{C}(b,\boldsymbol{w}) &=& \lambda \boldsymbol{w} + \frac{1}{N} \sum_{k=1}^N \nabla_{\boldsymbol{\boldsymbol{w}}} \ell_{\text{hinge}}(\boldsymbol{x}_k, y_k;b,\boldsymbol{w}), \\ \frac{\partial }{\partial b} \mathcal{C}(b,\boldsymbol{w}) &=& \frac{1}{N} \sum_{k=1}^N \frac{\partial }{\partial b} \ell_{\text{hinge}}(\boldsymbol{x}_k, y_k;b,\boldsymbol{w}). \end{eqnarray}

(3.9)

The subgradient descent update rule for minimizing the primal soft margin SVM with hinge-loss objective is defined as

\begin{eqnarray} \boldsymbol{w}_{t+1} &=& \boldsymbol{w}_t -\eta \nabla_{\boldsymbol{\boldsymbol{w}}} \mathcal{C}(b_t,\boldsymbol{w}_t),\\ b_{t+1} &=& b_t - \eta\frac{\partial }{\partial b} \mathcal{C}(b_t,\boldsymbol{w}_t), \end{eqnarray}

(3.10)

where \(0<\eta<1\) is the learning rate.

Algorithm:

[Pegasos]

Input:

the training dataset \(\mathcal{D}_\text{train}=\{(\boldsymbol{x}_k,y_k)~|~ k=1,2,\ldots, N_\text{train}\}\);
the initial weight vector \(\boldsymbol{w}_0=(w_{0,1}, \ldots, w_{0,n})\in \mathbb{R}^{n}\) and the bias \(b_0\);
an integer \(0< m \le N_\text{train}\);
a sufficiently large positive integer \(T\); and
a regularization parameter \(\lambda \in (0,\infty)\).

Processing: [Subgradient Update Rule]

Step 1: For each \(t=1,2,\ldots, T\), select a set of \(m\) distinct training example, iid randomly, and denote it as

\[ A_t := \Big\{(\boldsymbol{x}_{k_i}, y_{k_i})~|~\text{for each } i=1,2,\ldots, m, k_i\in \{1,2,\ldots, N_{\rm train}\} \Big\}. \]

Step 2: Check and collect the set of points \(A_{t}^- \subseteq A_t\) that are not classified correctly by the hyperplane \((b_{t-1}, \boldsymbol{w}_{t-1})\). Let the cardinality be \(\#(A_{t}^-) = m_t.\)

Step 3: Set \(\eta_t = \dfrac{1}{\lambda t}\).

Step 4: Perform the subgradient descent update as follows: If \(m_t>0\), then

\begin{eqnarray} \boldsymbol{w}_{t} &=& (1- \eta_t\lambda) \boldsymbol{w}_{t-1} + \frac{\eta_t}{m_t} \sum_{(\boldsymbol{x},y)\in A_{t}^-} y \boldsymbol{x}\\ b_{t} &=& b_{t-1} - \frac{\eta_t}{m_t} \sum_{(\boldsymbol{x},y)\in A_{t}^-}y. \end{eqnarray}

(3.11)

Else,

\begin{eqnarray} \boldsymbol{w}_{t} &=& (1- \eta_t\lambda) \boldsymbol{w}_{t-1} \\ b_{t} &=& b_{t-1}. \end{eqnarray}

(3.12)

Output: \((b_T, \boldsymbol{w}_T).\)

Note:

Pegasos stands for Primal Estimated sub-GrAdient SOlver for SVM. The algorithm is proposed by

Shalev-Shwartz, S., Singer, Y., Srebro, N., and Cotter, A. Pegasos: primal estimated sub-gradient solver for SVM. Math. Program, 127, pp 3--30 (2011).

DOI (https://doi.org/10.1007/s10107-010-0420-4)

This method is often referred to as subgradient projection method. For \(m=N_{\text{train}},\) the method is the deterministic subgradient method and for \(m=1\), the method is called the stochastic sub-gradient method.

Dual Optimazation Problem

The SVM problems (3.3) and (3.6) are referred to as the primal optimization problems. These are convex optimization problems with quadratic objective functions and linear constraints. An alternative formulation is derived from the primal problem which is referred to as the dual optimization problem which are equivalent to the original problem.

A convex dual problem can be derived by forming the Lagrangian of the primal problem as

\begin{eqnarray} \mathcal{L}(b, \boldsymbol{w}, \boldsymbol{\xi}, \boldsymbol{\alpha}, \boldsymbol{\beta}) = \frac{1}{2} \|\boldsymbol{w}\|_2^2 + C \sum_{k=1}^N \xi_k - \left( \sum_{k=1}^N \alpha_k \left[ y_k(\boldsymbol{w} \cdot \boldsymbol{x}_k - b) - 1 + \xi_k \right] + \sum_{k=1}^N \beta_k \xi_k \right), \end{eqnarray}

(3.13)

where the first two terms correspond to the objective function in (3.6), and the expressions inside the brackets correspond to the inequality constraints enforced using the parameter vectors \(\boldsymbol{\alpha}\ge \boldsymbol{0}\) and \(\boldsymbol{\beta}\ge \boldsymbol{0}\), called the Lagrange multipliers.

The gradient of \(\mathcal{L}\) with respect to \(\boldsymbol{w}\) is given by

\begin{eqnarray} \nabla_{\!\!\boldsymbol{w}}\, \mathcal{L} = \boldsymbol{w} - \sum_{k=1}^N \alpha_k y_k \boldsymbol{x}_k. \end{eqnarray}

(3.14)

Also, differentiating \(\mathcal{L}\) with respect to the bias \(b\) and \(\xi_k\), for \(k=1,2,\ldots, N\), gives

\begin{eqnarray} \frac{\partial \mathcal{L}}{\partial b} =\sum_{k=1}^N \alpha_k y_k \end{eqnarray}

(3.15)

\begin{eqnarray} \frac{\partial \mathcal{L}}{\partial \xi_k} = C-\alpha_k - \beta_k. \end{eqnarray}

(3.16)

In order to obtain the extremum of the Lagrangian, we equate the above three expressions to zero. First, equating gradient vector in (3.14) to zero, we obtain

\begin{eqnarray} \boldsymbol{w} = \sum_{j=1}^N \alpha_j y_j \boldsymbol{x}_j, \end{eqnarray}

(3.17)

which shows that the optimal weight vector of the primal problem (3.6) can be obtained as a linear combination of the input vectors where the coefficients are such that

\begin{eqnarray} \displaystyle{\sum_{k=1}^N} \alpha_k y_k = 0 \end{eqnarray}

(3.18)

from stationarity of the Lagrangian with respect to \(b\). Note that only those vectors \(\boldsymbol{x}_k\) for which \(\alpha_k>0\), for \(k=1,2,\ldots, N\) contribute to the weight \(\boldsymbol{w}\) and they are called support vectors.

Substituting (3.17) and (3.18) into (3.6), we get

\[ \mathcal{L}_{\tiny D}(\boldsymbol{\xi}, \boldsymbol{\alpha}, \boldsymbol{\beta}) = -\frac{1}{2} \sum_{j=1}^N \sum_{k=1}^N \alpha_j \alpha_k y_j y_k \langle \boldsymbol{x}_j, \boldsymbol{x}_k\rangle + \sum_{k=1}^N \alpha_k + \sum_{k=1}^N (C - \alpha_k - \beta_k) \xi_k. \]

Equating (3.16) to zero, we obtain

\begin{eqnarray} \mathcal{L}_{\tiny D}(\boldsymbol{\alpha}) = -\frac{1}{2} \sum_{j=1}^N \sum_{k=1}^N \alpha_j \alpha_k y_j y_k \langle \boldsymbol{x}_j, \boldsymbol{x}_k\rangle + \sum_{k=1}^N \alpha_k . \end{eqnarray}

(3.19)

Thus, we have proved

\[ \mathcal{L}_{\tiny D}(\boldsymbol{\alpha}) = \displaystyle{\min_{(b, \boldsymbol{w}), \boldsymbol{\xi}}} \mathcal{L}(b, \boldsymbol{w},\boldsymbol{\xi},\boldsymbol{\alpha},\boldsymbol{\beta}). \]

Further, since \(\boldsymbol{\beta}\ge \boldsymbol{0}\), we see that \(\alpha_k\le C\) for each \(k=1,2,\ldots, N\).

The following constrained optimization problem is called the Lagrange dual problem associated to the primal problem (3.6):

\[ \left.\begin{array}{ll} & \displaystyle{\max_{\boldsymbol{\alpha}}}~\mathcal{L}_{\tiny D}(\boldsymbol{\alpha})\\ \text{subject to:}& \left\{\begin{array}{cl} \displaystyle{\sum_{j=1}^N} \alpha_j y_j = 0, \\ 0\le \alpha_k\le C,&k=1,2,\ldots, N. \end{array}\right. \end{array}\right\} \]

Using (3.19), the Lagrange dual problem can also be written as

\begin{eqnarray} \left.\begin{array}{ll} & \displaystyle{\min_{\boldsymbol{\alpha}}}~\left(\frac{1}{2} \sum_{j=1}^N \sum_{k=1}^N \alpha_j \alpha_k y_j y_k \langle \boldsymbol{x}_j, \boldsymbol{x}_k\rangle - \sum_{k=1}^N \alpha_k \right) \\ \text{subject to:}& \left\{\begin{array}{cl} \displaystyle{\sum_{j=1}^N} \alpha_j y_j = 0, \\ 0\le \alpha_k\le C,&k=1,2,\ldots, N. \end{array}\right. \end{array}\right\} \end{eqnarray}

(3.20)

The above dual problem gives \(\boldsymbol{\alpha}^*\) which can be substituted in (3.17) to obtain the weight vector \(\boldsymbol{w}^*\). To obtain an approximation to \(b^*\), compute \(|y_k - \boldsymbol{w}^*\cdot \boldsymbol{x}_k|\) for all support vectors \(\boldsymbol{x}_k\) from the dataset and take the median value as the value of \(b^*\).

Remark:

A point \((b^*, \mathbf{w}^*, \boldsymbol{\xi}^*, \boldsymbol{\alpha}^*, \boldsymbol{\beta}^*)\) is a saddle point of \(\mathcal{L}\) if

\begin{eqnarray} \mathcal{L}(b^*, \mathbf{w}^*, \boldsymbol{\xi}^*, \boldsymbol{\alpha}, \boldsymbol{\beta}) \le \mathcal{L}(b^*, \mathbf{w}^*, \boldsymbol{\xi}^*, \boldsymbol{\alpha}^*, \boldsymbol{\beta}^*) \le \mathcal{L}(b, \mathbf{w}, \boldsymbol{\xi}, \boldsymbol{\alpha}^*, \boldsymbol{\beta}^*) \end{eqnarray}

(3.21)

for all feasible \(b, \mathbf{w}, \boldsymbol{\xi} \ge 0\) and \(\boldsymbol{\alpha}, \boldsymbol{\beta} \ge 0\).

Observe that

the left inequality corresponds to maximization over the dual variables \((\boldsymbol{\alpha}, \boldsymbol{\beta})\).

the right inequality corresponds to minimization over the primal variables \((b, \mathbf{w}, \boldsymbol{\xi})\).

Thus, a saddle point is a point where the Lagrangian is simultaneously minimal with respect to the primal variables and maximal with respect to the dual variables. This is exactly the point that satisfies the Karush-Kuhn-Tucker (KKT) conditions for the SVM problem, which are given as follows:

Stationarity:

\begin{eqnarray} \frac{\partial \mathcal{L}}{\partial \mathbf{w}} &=& \mathbf{w} - \sum_{k=1}^N \alpha_k y_k \mathbf{x}_k = 0, \\ \frac{\partial \mathcal{L}}{\partial b} &=& \sum_{k=1}^N \alpha_k y_k = 0, \\ \frac{\partial \mathcal{L}}{\partial \xi_k} &=& C - \alpha_k - \beta_k = 0, \quad k=1,\dots,N, \nonumber \end{eqnarray}

(3.22)

Primal feasibility:

\begin{eqnarray} y_k (\mathbf{w} \cdot \mathbf{x}_k - b) \ge 1 - \xi_k, && k=1,\dots,N, \\ \xi_k \ge 0, && k=1,\dots,N, \end{eqnarray}

(3.23)

Dual feasibility:

\begin{eqnarray} \alpha_k \ge 0, \quad \beta_k \ge 0, \quad k=1,\dots,N, \nonumber \end{eqnarray}

(3.24)

Complementary slackness:

\begin{eqnarray} \alpha_k \big[ y_k (\mathbf{w} \cdot \mathbf{x}_k - b) - 1 + \xi_k \big] = 0, && k=1,\dots,N, \\ \beta_k \, \xi_k = 0, && k=1,\dots,N. \nonumber \end{eqnarray}

(3.25)

A point \((\mathbf{w}^*, b^*, \boldsymbol{\xi}^*, \boldsymbol{\alpha}^*, \boldsymbol{\beta}^*)\) satisfying these conditions is a saddle point of \(\mathcal{L}\).

Problem:

Show that an input vector \(\boldsymbol{x}_i\) is a support vector of a soft-margin SVM if and only if its associated Lagrange multiplier \(\alpha_i > 0\) at the saddle point of the Lagrangian.

These problems have been extensively studied in the optimization literature, and various computational algorithms are available. We omit further discussion of this topic, as our primary focus is on deep learning techniques. SVM is introduced here mainly for comparison, to highlight an alternative classical approach to classification problems within the broader field of machine learning.

Nonlinearly Separable Datasets

So far, we have formulated the SVM problems to classify a given dataset. If the dataset is linearly separable, then hard margin SVM can be efficiently used. On the other hand, the soft margin SVM allows some violations of the margin constraints by introducing slack variables. In this way, the soft margin SVM can handle nonlinearly separable dataset in the given input space. However, this method finally obtains a hyperplane as the decision boundary, and hence is a linear classifier in the input space with a better tolerance to noise or overlap.

In the previous chapter, we discussed how mapping data into a higher-dimensional feature space can make a dataset linearly separable. In this section, we extend our discussion from the previous chapter on feature mapping and introduce the kernel method, which enables SVM to be a nonlinear classifier.

Kernel Method: Implicit Feature Mapping

At the end of previous chapter, we have seen that through a suitable feature map, one can transform the dataset from the input space to the feature space where the dataset is linearly separable. However, explicitly constructing such a map is often impossible, especially when the dataset is large or when the separation requires a highly nonlinear decision boundary.

Kernel methods overcome this difficulty by introducing a kernel function that compute inner products in a higher dimensional feature space implicitly. This approach is known as the kernel trick. In this subsection, we outline the idea of kernel trick without getting into technical details.

First, let us give the definition of a kernel function in the context of machine learning.

Definition:

[Kernel Function]

Let \(\mathcal{X}\) denote an input space. A function \(\text{𝕜}: \mathcal{X}\times \mathcal{X} \rightarrow \mathbb{R}\) is called a kernel function over \(\mathcal{X}\) if there exists a feature map \(\boldsymbol{\phi}: \mathcal{X} \rightarrow \mathbb{H}\), for some Hilbert space \(\mathbb{H},\) such that

\[ \text{𝕜}(\boldsymbol{x}_1, \boldsymbol{x}_2) = \langle \boldsymbol{\phi}(\boldsymbol{x}_1), \boldsymbol{\phi}(\boldsymbol{x}_2) \rangle, ~\text{for all}~\boldsymbol{x}_1, \boldsymbol{x}_2\in \mathcal{X}, \]

where \(\langle \cdot, \cdot \rangle\) denotes the inner product on \(\mathbb{H}\).

The following results gather some important properties of kernel functions.

Problem:

For any finite set of points \(\{\boldsymbol{x}_1,\boldsymbol{x}_2,\dots,\boldsymbol{x}_N\}\subset \mathcal{X}\), show that the Gram matrix

\[ G = \Big(\text{𝕜}(\boldsymbol{x}_i, \boldsymbol{x}_j)\Big)_{i,j=1}^N \]

is symmetric and positive semidefinite.

Problem:

Let \(\text{𝕜}_1, \text{𝕜}_2:\mathcal{X}\times\mathcal{X}\to\mathbb{R}\) be kernel functions. That is, there exists feature maps \(\boldsymbol{\phi}_1: \mathcal{X}\rightarrow \mathbb{H}_1\) and \(\boldsymbol{\phi}_2: \mathcal{X}\rightarrow \mathbb{H}_2\) such that

\[ \text{𝕜}_j(\boldsymbol{x},\boldsymbol{y}) = \langle \boldsymbol{\phi}_j(\boldsymbol{x}),\boldsymbol{\phi}_j(\boldsymbol{y})\rangle_{\mathbb{H}_j},~~j=1,2. \]

Then show the following properties:

Linear combination: For any \(\alpha,\beta \ge 0\),
\[ \text{𝕜}(\boldsymbol{x},\boldsymbol{y}) = \alpha \text{𝕜}_1(\boldsymbol{x},\boldsymbol{y}) + \beta \text{𝕜}_2(\boldsymbol{x},\boldsymbol{y}) \]
is a kernel.
Product:
\[ \text{𝕜}(\boldsymbol{x},\boldsymbol{y}) = \text{𝕜}_1(\boldsymbol{x},\boldsymbol{y}) \cdot \text{𝕜}_2(\boldsymbol{x},\boldsymbol{y}) \]
is a kernel.
Polynomial transformation: If \(p(x)\) is a polynomial with non-negative coefficients, then \(p(\text{𝕜}_1(\boldsymbol{x},\boldsymbol{y}))\) is a kernel.
Exponential transformation: \(\exp(\text{𝕜}_1(\boldsymbol{x},\boldsymbol{y}))\) is also a kernel.
For any function \(f:\mathcal{X}\rightarrow \mathbb{R}\), the function \(\text{𝕜}(\boldsymbol{x},\boldsymbol{y}) = f(\boldsymbol{x})\text{𝕜}_1(\boldsymbol{x},\boldsymbol{y})f(\boldsymbol{y})\) is a kernel.

Observe that the dual problem (3.20) depends on the input vectors only through the inner products appearing in the objective function. Thus, one can implicitly work in a feature space by replacing the inner product of two input vectors with a kernel function based on a feature map \(\phi\). This procedure is often referred to as the kernel trick. Let us make the idea of kernel trick more precise.

Kernel Trick

Let us explain kernel trick more precisely. Following (3.17), the representation of the weight vector in the feature space can be written as

\[ \boldsymbol{w} = \sum_{j=1}^N \alpha_j y_j \boldsymbol{\phi}(\boldsymbol{x}_j). \]

Thus, the hyperplane as the decision boundary in the input space is now transformed into a nonlinear function in the feature space given by

\[ \sum_{j=1}^N \alpha_j y_j \langle \boldsymbol{\phi}(\boldsymbol{x}_j), \boldsymbol{\phi}(\boldsymbol{x}) \rangle - b = 0, \]

where \(\alpha_k\), for \(k=1,2,\ldots, N\), are obtained by solving the dual problem (3.20), but the objective function is now posed in the feature space as

\[ \displaystyle{\min_{\boldsymbol{\alpha}}}~\left(\frac{1}{2} \sum_{j=1}^N \sum_{k=1}^N \alpha_j \alpha_k y_j y_k \langle \boldsymbol{\phi}(\boldsymbol{x}_j), \boldsymbol{\phi}(\boldsymbol{x}_k)\rangle - \sum_{k=1}^N \alpha_k \right). \]

The kernel trick is to replace the inner product in the above objective function with a kernel function \(\text{𝕜}(\boldsymbol{x_j}, \boldsymbol{x}_k)\). Hence, the dual problem under consideration is

\begin{eqnarray} \left.\begin{array}{ll} & \displaystyle{\min_{\boldsymbol{\alpha}}}~\left(\frac{1}{2} \sum_{j=1}^N \sum_{k=1}^N \alpha_j \alpha_k y_j y_k \text{𝕜}(\boldsymbol{x_j}, \boldsymbol{x}_k) - \sum_{k=1}^N \alpha_k \right) \\ \text{subject to:}& \left\{\begin{array}{cl} \displaystyle{\sum_{j=1}^N} \alpha_j y_j = 0, \\ 0\le \alpha_k\le C,&k=1,2,\ldots, N. \end{array}\right. \end{array}\right\} \end{eqnarray}

(3.26)

The kernel trick enables us to use certain kernels without explicitly defining the corresponding feature map.

Example:

[Commonly Used Kernel Functions]

Some of the widely used kernel functions are listed below:

Linear kernel: This kernel is defined as
\begin{eqnarray} \text{𝕜}(\boldsymbol{x}_1, \boldsymbol{x}_2) = \langle \boldsymbol{x}_1, \boldsymbol{x}_2 \rangle, \end{eqnarray}
(3.27)
which corresponds to no feature mapping (the feature space is the same as input space).
Polynomial kernel: This kernel is given by
\begin{eqnarray} \text{𝕜}(\boldsymbol{x}_1, \boldsymbol{x}_2) = \big( \langle \boldsymbol{x}_1, \boldsymbol{x}_2 \rangle + c \big)^d, \quad c \geq 0,\; d \in \mathbb{N}, \end{eqnarray}
(3.28)
which maps into a higher-dimensional space involving monomials up to degree \(d\).
Radial Basis Function (RBF) or Gaussian kernel: This is a commonly used kernel given by
\begin{eqnarray} \text{𝕜}(\boldsymbol{x}_1, \boldsymbol{x}_2) = \exp\!\left( - \frac{\|\boldsymbol{x}_1 - \boldsymbol{x}_2\|^2}{2\sigma^2} \right), \quad \sigma > 0, \end{eqnarray}
(3.29)
which corresponds to an infinite-dimensional feature space.
Sigmoid kernel: This kernel is defined as
\begin{eqnarray} \text{𝕜}(\boldsymbol{x}_1, \boldsymbol{x}_2) = \tanh\!\big( \kappa \langle \boldsymbol{x}_1, \boldsymbol{x}_2 \rangle + c \big), \quad \kappa > 0,\; c \in \mathbb{R}. \end{eqnarray}
(3.30)
This kernel originates from the activation function of a neural network with one hidden layer.
Exponential kernel: This kernel is given by
\begin{eqnarray} \text{𝕜}(\boldsymbol{x}_1, \boldsymbol{x}_2) = \exp\!\left( - \frac{\|\boldsymbol{x}_1 - \boldsymbol{x}_2\|}{\sigma} \right), \quad \sigma > 0. \end{eqnarray}
(3.31)
This is similar to RBF but uses the \(L^1\)-distance instead of squared \(L^2\)-distance.
Rational quadratic kernel: This kernel is given by
\begin{eqnarray} \text{𝕜}(\boldsymbol{x}_1,\boldsymbol{x}_2) = 1 - \frac{\|\boldsymbol{x}_1 - \boldsymbol{x}_2\|^2}{\|\boldsymbol{x}_1 - \boldsymbol{x}_2\|^2 + c}, \quad c > 0. \end{eqnarray}
(3.32)
This kernel acts like a scale mixture of RBF kernels with different length scales.

Let us illustrate the kernel trick by use the quadratic polynomial kernel to train a hard margin SVM for the XOR function.

Example:

Let \(\boldsymbol{x} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}, \; \boldsymbol{y} = \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} \in \mathbb{R}^2\). The quadratic kernel is

\begin{eqnarray} \text{𝕜}(\boldsymbol{x}, \boldsymbol{y}) &=& (\langle \boldsymbol{x}, \boldsymbol{y}\rangle + c)^2\\ \end{eqnarray}

(3.33)

Define

\begin{eqnarray} \boldsymbol{\phi}(\boldsymbol{x}) = \begin{bmatrix} x_1^2 \\ x_2^2 \\ \sqrt{2} x_1 x_2 \\ \sqrt{2c} \, x_1 \\ \sqrt{2c} \, x_2 \\ c \end{bmatrix} \in \mathbb{R}^6. \end{eqnarray}

(3.34)

Then the kernel can be written as an inner product in this feature space:

\begin{eqnarray} \text{𝕜}(\boldsymbol{x}, \boldsymbol{y}) = \langle \boldsymbol{\phi}(\boldsymbol{x}), \boldsymbol{\phi}(\boldsymbol{y}) \rangle = x_1^2 y_1^2 + x_2^2 y_2^2 + 2 x_1 x_2 y_1 y_2 + 2 c x_1 y_1 + 2 c x_2 y_2 + c^2. \end{eqnarray}

(3.35)

Consider the XOR function with 0 replaced by \(-1\). Using the quadratic feature map \(\phi\) given above, the dataset in the feature space is given by the following table:

Problem:

Show that the XOR dataset is linearly separable in this feature space. Find the hard margin SVM and identify all the support vectors.