Math ∩ Programming

Formulating the support vector machine optimization problem, the hypothesis and the setup.

This blog post has an interactive demo (mostly used toward the end of the post). The source for this demo is available in a Github repository .

Last time we saw how the inner product of two vectors gives rise to a decision rule: if $ w$ is the normal to a line (or hyperplane) $ L$, the sign of the inner product $ \langle x, w \rangle$ tells you whether $ x$ is on the same side of $ L$ as $ w$.

Let’s translate this to the parlance of machine-learning. Let $ x \in \mathbb{R}^n$ be a training data point, and $ y \in \{ 1, -1 \}$ is its label (green and red, in the images in this post). Suppose you want to find a hyperplane which separates all the points with -1 labels from those with +1 labels (assume for the moment that this is possible). For this and all examples in this post, we’ll use data in two dimensions, but the math will apply to any dimension.

problem_setup

Some data labeled red and green, which is separable by a hyperplane (line).

The  hypothesis  we’re proposing to separate these points is a hyperplane, i.e. a linear subspace that splits all of $ \mathbb{R}^n$ into two halves. The data that represents this hyperplane is a single vector $ w$, the normal to the hyperplane, so that the hyperplane is defined by the solutions to the equation $ \langle x, w \rangle = 0$.

As we saw last time, $ w$ encodes the following rule for deciding if a new point $ z$ has a positive or negative label.

$ \displaystyle h_w(z) = \textup{sign}(\langle w, x \rangle)$

You’ll notice that this formula only works for the normals $ w$ of hyperplanes that pass through the origin, and generally we want to work with data that can be shifted elsewhere. We can resolve this by either adding a fixed term $ b \in \mathbb{R}$—often called a bias because statisticians came up with it—so that the shifted hyperplane is the set of solutions to $ \langle x, w \rangle + b = 0$. The shifted decision rule is:

$ \displaystyle h_w(z) = \textup{sign}(\langle w, x \rangle + b)$

Now the hypothesis is the pair of vector-and-scalar $ w, b$.

The key intuitive idea behind the formulation of the SVM problem is that there are  many possible separating hyperplanes for a given set of labeled training data. For example, here is a gif showing infinitely many choices.

The question is: how can we find the separating hyperplane that not only separates the training data, but generalizes as well as possible to new data? The assumption of the SVM is that a hyperplane which separates the points, but is also  as far away from any training point as possible , will generalize best.

optimal_example.png

While contrived, it’s easy to see that the separating hyperplane is as far as possible from any training point.

More specifically, fix a labeled dataset of points $ (x_i, y_i)$, or more precisely:

$ \displaystyle D = \{ (x_i, y_i) \mid i = 1, \dots, m, x_i \in \mathbb{R}^{n}, y_i \in \{1, -1\}  \}$

And a hypothesis defined by the normal $ w \in \mathbb{R}^{n}$ and a shift $ b \in \mathbb{R}$. Let’s also suppose that $ (w,b)$ defines a hyperplane that correctly separates all the training data into the two labeled classes, and we just want to measure its quality. That measure of quality is the length of its  margin.

Definition: The geometric margin of a hyperplane $ w$ with respect to a dataset $ D$ is the shortest distance from a training point $ x_i$ to the hyperplane defined by $ w$.

The best hyperplane has the largest possible margin.

This margin can even be computed quite easily using our work from last post . The distance from $ x$ to the hyperplane defined by $ w$ is the same as the length of the projection of $ x$ onto $ w$. And this is just computed by an inner product.

decision-rule-3

If the tip of the $ x$ arrow is the point in question, then $ a$ is the dot product, and $ b$ the distance from $ x$ to the hyperplane $ L$ defined by $ w$.

A naive optimization objective

If we wanted to, we could stop now and define an optimization problem that would be very hard to solve. It would look like this:

$ \displaystyle \begin{aligned} & \max_{w} \min_{x_i} \left | \left \langle x_i, \frac{w}{\|w\|} \right \rangle + b \right | & \\ \textup{subject to \ \ } & \textup{sign}(\langle x_i, w \rangle + b) = \textup{sign}(y_i) & \textup{ for every } i = 1, \dots, m \end{aligned}$

The formulation is hard. The reason is it’s horrifyingly nonlinear. In more detail:

  • The constraints are nonlinear due to the sign comparisons.
  • There’s a min  and a max! A priori, we have to do this because we don’t know which point is going to be the closest to the hyperplane.
  • The objective is nonlinear in two ways: the absolute value and the projection requires you to take a norm and divide.

The rest of this post (and indeed, a lot of the work in grokking SVMs) is dedicated to converting this optimization problem to one in which the constraints are all linear inequalities and the objective is a single, quadratic polynomial we want to minimize or maximize.

Along the way, we’ll notice some neat features of the SVM.

Trick 1: linearizing the constraints

To solve the first problem, we can use a trick. We want to know whether $ \textup{sign}(\langle x_i, w \rangle + b) = \textup{sign}(y_i)$ for a labeled training point $ (x_i, y_i)$. The trick is to multiply them together. If their signs agree, then their product will be positive, otherwise it will be negative.

So each constraint becomes:

$ \displaystyle (\langle x_i, w \rangle + b) \cdot y_i \geq 0$

This is still linear because $ y_i$ is a constant (input) to the optimization problem. The variables are the coefficients of $ w$.

The left hand side of this inequality is often called the  functional margin of a training point, since, as we will see, it still works to classify $ x_i$, even if $ w$ is scaled so that it is no longer a unit vector. Indeed, the sign of the inner product is independent of how $ w$ is scaled.

Trick 1.5: the optimal solution is midway between classes

This small trick is to notice that if $ w$ is the supposed optimal separating hyperplane, i.e. its margin is maximized, then it must necessarily be exactly halfway in between the closest points in the positive and negative classes.

In other words, if $ x_+$ and $ x_-$ are the closest points in the positive and negative classes, respectively, then $ \langle x_{+}, w \rangle + b = -(\langle x_{-}, w \rangle + b)$. If this were not the case, then you could adjust the bias, shifting the decision boundary along $ w$ until it they are exactly equal, and you will have increased the margin. The closest point, say $ x_+$ will have gotten farther away, and the closest point in the opposite class, $ x_-$ will have gotten closer, but will not be closer than $ x_+$.

Trick 2: getting rid of the max + min

Resolving this problem essentially uses the fact that the hypothesis, which comes in the form of the normal vector $ w$, has a degree of freedom in its length. To explain the details of this trick, we’ll set $ b=0$ which simplifies the intuition.

Indeed, in the animation below, I can increase or decrease the length of $ w$ without changing the decision boundary.

I have to keep my hand very steady (because I was too lazy to program it so that it only increases/decreases in length), but you can see the point. The line is perpendicular to the normal vector, and it doesn’t depend on the length.

Let’s combine this with tricks 1 and 1.5. If we increase the length of $ w$, that means the absolute values of the dot products $ \langle x_i, w \rangle $ used in the constraints will all increase by the same amount (without changing their sign). Indeed, for any vector $ a$ we have $ \langle a, w \rangle = \|w \| \cdot \langle a, w / \| w \| \rangle$.

In this world, the inner product measurement of distance from a point to the hyperplane is no longer faithful. The true distance is $ \langle a, w / \| w \| \rangle$, but the distance measured by $ \langle a, w \rangle$ is measured in units of $ 1 / \| w \|$.

units.png

In this example, the two numbers next to the green dot represent the true distance of the point from the hyperplane, and the dot product of the point with the normal (respectively). The dashed lines are the solutions to <x, w> = 1. The magnitude of w is 2.2, the inverse of that is 0.46, and indeed 2.2 = 4.8 * 0.46 (we’ve rounded the numbers).

Now suppose we had the optimal hyperplane and its normal $ w$. No matter how near (or far) the nearest positively labeled training point $ x$ is, we could scale the length of $ w$ to force $ \langle x, w \rangle = 1$. This is the core of the trick. One consequence is that the  actual distance from $ x$ to the hyperplane is $ \frac{1}{\| w \|} = \langle x, w / \| w \| \rangle$.

units2.png

The same as above, but with the roles reversed. We’re forcing the inner product of the point with w to be 1. The true distance is unchanged.

In particular, if we force the closest point to have inner product 1, then all other points will have inner product at least 1. This has two consequences. First, our constraints change to $ \langle x_i, w \rangle \cdot y_i \geq 1$ instead of $ \geq 0$. Second, we no longer need to ask which point is closest to the candidate hyperplane! Because after all, we never cared  which point it was, just  how far away that closest point was. And now we know that it’s exactly $ 1 / \| w \|$ away. Indeed, if the optimal points weren’t at that distance, then that means the closest point doesn’t exactly meet the constraint, i.e. that $ \langle x, w \rangle > 1$ for every training point $ x$. We could then scale $ w$ shorter until $ \langle x, w \rangle = 1$, hence increasing the margin $ 1 / \| w \|$.

In other words, the coup de grâce, provided all the constraints are satisfied, the optimization objective is just to maximize $ 1 / \| w \|$, a.k.a. to minimize $ \| w \|$.

This intuition is clear from the following demonstration, which you can try for yourself . In it I have a bunch of positively and negatively labeled points, and the line in the center is the candidate hyperplane with normal $ w$ that you can drag around. Each training point has two numbers next to it. The first is the true distance from that point to the candidate hyperplane; the second is the inner product with $ w$. The two blue dashed lines are the solutions to $ \langle x, w \rangle = \pm 1$. To solve the SVM by hand, you have to ensure the second number is at least 1 for all green points, at most -1 for all red points, and then you have to make $ w$ as short as possible. As we’ve discussed, shrinking $ w$ moves the blue lines farther away from the separator, but in order to satisfy the constraints the blue lines can’t go further than any training point. Indeed, the optimum will have those blue lines touching a training point on each side.

I bet you enjoyed watching me struggle to solve it. And while it’s probably not  the optimal solution, the idea should be clear.

The final note is that, since we are now minimizing $ \| w \|$, a formula which includes a square root, we may as well minimize its square $ \| w \|^2 = \sum_j w_j^2$. We will also multiply the objective by $ 1/2$, because when we eventually analyze this problem we will take a derivative, and the square in the exponent and the $ 1/2$ will cancel.

The final form of the problem

Our optimization problem is now the following (including the bias again):

$ \displaystyle \begin{aligned} & \min_{w}  \frac{1}{2} \| w \|^2 & \\ \textup{subject to \ \ } & (\langle x_i, w \rangle + b) \cdot y_i \geq 1 & \textup{ for every } i = 1, \dots, m \end{aligned}$

This is much simpler to analyze. The constraints are all linear inequalities (which, because of linear programming, we know are tractable to optimize). The objective to minimize, however, is a convex quadratic function of the input variables—a sum of squares of the inputs.

Such problems are generally called  quadratic programming problems (or QPs, for short). There are general methods to find solutions! However, they often suffer from numerical stability issues and have less-than-satisfactory runtime. Luckily, the form in which we’ve expressed the support vector machine problem is specific enough that we can analyze it directly, and find a way to solve it without appealing to general-purpose numerical solvers.

We will tackle this problem in a future post (planned for two posts sequel to this one). Before we close, let’s just make a few more observations about the solution to the optimization problem.

Support Vectors

In Trick 1.5 we saw that the optimal separating hyperplane has to be exactly halfway between the two closest points of opposite classes. Moreover, we noticed that, provided we’ve scaled $ \| w \|$ properly, these closest points (there may be multiple for positive and negative labels) have to be exactly “distance” 1 away from the separating hyperplane.

Another way to phrase this without putting “distance” in scare quotes is to say that, if $ w$ is the normal vector of the optimal separating hyperplane, the closest points lie on the two lines $ \langle x_i, w \rangle + b = \pm 1$.

Now that we have some intuition for the formulation of this problem, it isn’t a stretch to realize the following. While a dataset may include  many points from either class on these two lines $ \langle x_i, w \rangle = \pm 1$, the optimal hyperplane itself does not depend on any of the other points except these closest points.

This fact is enough to give these closest points a special name: the support vectors .

We’ll actually prove that support vectors “are all you need” with full rigor and detail next time, when we cast the optimization problem in this post into the “dual” setting. To avoid vague names, the formulation described in this post called the “primal” problem. The dual problem is derived from the primal problem, with special variables and constraints chosen based on the primal variables and constraints. Next time we’ll describe in brief detail what the dual does and why it’s important, but we won’t have nearly enough time to give a full understanding of duality in optimization (such a treatment would fill a book).

When we compute the dual of the SVM problem, we will see explicitly that the hyperplane can be written as a linear combination of the support vectors. As such, once you’ve found the optimal hyperplane, you can compress the training set into  just the support vectors, and reproducing the same optimal solution becomes much, much faster. You can also use the support vectors to augment the SVM to incorporate streaming data (throw out all non-support vectors after every retraining).

Eventually, when we get to implementing the SVM from scratch, we’ll see all this in action.

Until then!

Share this:

  • Click to share on Facebook (Opens in new window)
  • Click to share on Reddit (Opens in new window)
  • Click to share on Twitter (Opens in new window)
  • Click to email a link to a friend (Opens in new window)
  • Click to share on Pinterest (Opens in new window)
  • Click to share on LinkedIn (Opens in new window)
  • Click to share on Tumblr (Opens in new window)

5 thoughts on “ Formulating the Support Vector Machine Optimization Problem ”

This is a wonderful description, along with your previous piece! Thank you.

Great post! The interactive demo is great!

In your max min equation in the original formulation, isn’t the max really being used as argmax whereas min is an honest to god min function?

Yes. I think often (in other optimization problems) the solution of the max alone can be used to efficiently find the argmax, so the two are used interchangeably. I’m not sure if that applies to this problem, but it explains my ignoring the distinction.

Thanks for the great post! In the fourth image, it looks like a is the distance from the point x to the hyperplane L. The caption says it’s b, which instead looks like the distance from x to the vector w (normal to the hyperplane).

Leave a Reply Cancel reply

Discover more from math ∩ programming.

Subscribe now to keep reading and get access to the full archive.

Type your email…

Continue reading

Lecture 9: SVM

The Support Vector Machine (SVM) is a linear classifier that can be viewed as an extension of the Perceptron developed by Rosenblatt in 1958. The Perceptron guaranteed that you find a hyperplane if it exists. The SVM finds the maximum margin separating hyperplane.

Figure 1: (Left:) Two different separating hyperplanes for the same data set. (Right:) The maximum margin hyperplane. The margin, $\gamma$, is the distance from the hyperplane (solid line) to the closest points in either class (which touch the parallel dotted lines).

We already saw the definition of a margin in the context of the Perceptron . A hyperplane is defined through $\mathbf{w},b$ as a set of points such that $\mathcal{H}=\left\{\mathbf{x}\vert{}\mathbf{w}^T\mathbf{x}+b=0\right\}$. Let the margin $\gamma$ be defined as the distance from the hyperplane to the closest point across both classes.

how to solve svm optimization problem

By definition, the margin and hyperplane are scale invariant: $\gamma(\beta\mathbf{w},\beta b)=\gamma(\mathbf{w},b), \forall \beta \neq 0$

Note that if the hyperplane is such that $\gamma$ is maximized, it must lie right in the middle of the two classes. In other words, $\gamma$ must be the distance to the closest point within both classes. (If not, you could move the hyperplane towards data points of the class that is further away and increase $\gamma$, which contradicts that $\gamma$ is maximized.)

Max Margin Classifier

These constraints are still hard to deal with, however luckily we can show that (for the optimal solution) they are equivalent to a much simpler formulation. (Makes sure you know how to prove that the two sets of constraints are equivalent.) $$ \begin{align} &\min_{\mathbf{w},b}\mathbf{w}^T\mathbf{w}&\\ &\textrm{s.t.} \ \ \ \forall i \ y_{i}(\mathbf{w}^T \mathbf{x}_{i}+b) \geq 1 & \end{align} $$

This new formulation is a quadratic optimization problem . The objective is quadratic and the constraints are all linear . We can be solve it efficiently with any QCQP (Quadratically Constrained Quadratic Program) solver. It has a unique solution whenever a separating hyper plane exists. It also has a nice interpretation: Find the simplest hyperplane (where simpler means smaller $\mathbf{w}^\top\mathbf{w}$) such that all inputs lie at least 1 unit away from the hyperplane on the correct side.

Support Vectors

For the optimal $\mathbf{w},b$ pair, some training points will have tight constraints, i.e. $$y_{i}(\mathbf{w}^T \mathbf{x}_{i}+b) = 1.$$ (This must be the case, because if for all training points we had a strict $>$ inequality, it would be possible to scale down both parameters $\mathbf{w},b$ until the constraints are tight and obtained an even lower objective value.) We refer to these training points as support vectors . Support vectors are special because they are the training points that define the maximum margin of the hyperplane to the data set and they therefore determine the shape of the hyperplane. If you were to move one of them and retrain the SVM, the resulting hyperplane would change. The opposite is the case for non-support vectors (provided you don't move them too much, or they would turn into support vectors themselves). This will become particularly important in the dual formulation for Kernel-SVMs .

SVM with soft constraints

If the data is low dimensional it is often the case that there is no separating hyperplane between the two classes. In this case, there is no solution to the optimization problems stated above. We can fix this by allowing the constraints to be violated ever so slight with the introduction of slack variables: $$\begin{matrix}         \min_{\mathbf{w},b}\mathbf{w}^T\mathbf{w}+C\sum_{i=1}^{n}\xi _{i}         \\         s.t. \ \forall i \ y_{i}(\mathbf{w}^T\mathbf{x}_{i}+b)\geq 1-\xi_i         \\         \forall i \ \xi_i \geq 0         \end{matrix}$$ The slack variable $\xi_i$ allows the input $\mathbf{x}_i$ to be closer to the hyperplane (or even be on the wrong side), but there is a penalty in the objective function for such "slack". If C is very large, the SVM becomes very strict and tries to get all points to be on the right side of the hyperplane. If C is very small, the SVM becomes very loose and may "sacrifice" some points to obtain a simpler (i.e. lower $\|\mathbf{w}\|_2^2$) solution.

Unconstrained Formulation:

Let us consider the value of $\xi_i$ for the case of $C\neq 0$. Because the objective will always try to minimize $\xi_i$ as much as possible, the equation must hold as an equality and we have: $$ \xi_i=\left\{ \begin{array}{cc} \ 1-y_{i}(\mathbf{w}^T \mathbf{x}_{i}+b) & \textrm{ if $y_{i}(\mathbf{w}^T \mathbf{x}_{i}+b) unconstrained version as loss function and regularizer: $$         \min_{\mathbf{w},b}\underbrace{\mathbf{w}^T\mathbf{w}}_{l_{2}-regularizer}+        C\ \sum_{i=1}^{n}\underbrace{\max\left [ 1-y_{i}(\mathbf{w}^T \mathbf{x}+b),0 \right ]}_{hinge-loss}         \label{eq:svmunconst} $$ This formulation allows us to optimize the SVM paramters ($\mathbf{w},b$) just like logistic regression (e.g. through gradient descent). The only difference is that we have the hinge-loss instead of the logistic loss .

Figure 2: The five plots above show different boundary of hyperplane and the optimal hyperplane separating example data, when C=0.01, 0.1, 1, 10, 100.
  • 90% Refund @Courses
  • Machine Learning Tutorial
  • Data Analysis Tutorial
  • Python - Data visualization tutorial
  • Machine Learning Projects
  • Machine Learning Interview Questions
  • Machine Learning Mathematics
  • Deep Learning Tutorial
  • Deep Learning Project
  • Deep Learning Interview Questions
  • Computer Vision Tutorial
  • Computer Vision Projects
  • NLP Project
  • NLP Interview Questions
  • Statistics with Python
  • 100 Days of Machine Learning

Related Articles

  • Solve Coding Problems

Getting Started with Machine Learning

  • An introduction to Machine Learning
  • Getting started with Machine Learning
  • What is Machine Learning?
  • Types of Machine Learning
  • Best Python libraries for Machine Learning
  • Difference Between Machine Learning and Artificial Intelligence
  • General steps to follow in a Machine Learning Problem

Data Preprocessing

  • ML | Introduction to Data in Machine Learning
  • ML | Understanding Data Processing
  • Python | Create Test DataSets using Sklearn
  • Generate Test Datasets for Machine learning
  • ML | Overview of Data Cleaning
  • One Hot Encoding in Machine Learning
  • ML | Dummy variable trap in Regression Models
  • What is Exploratory Data Analysis ?
  • ML | Feature Scaling - Part 1
  • Feature Engineering: Scaling, Normalization, and Standardization
  • Label Encoding in Python
  • ML | Handling Imbalanced Data with SMOTE and Near Miss Algorithm in Python

Classification & Regression

  • Ordinary Least Squares (OLS) using statsmodels
  • Linear Regression (Python Implementation)
  • ML | Multiple Linear Regression using Python
  • Polynomial Regression ( From Scratch using Python )
  • Implementation of Bayesian Regression
  • How to Perform Quantile Regression in Python
  • Isotonic Regression in Scikit Learn
  • Stepwise Regression in Python
  • Least Angle Regression (LARS)
  • Logistic Regression in Machine Learning
  • Understanding Activation Functions in Depth
  • Regularization in Machine Learning
  • Implementation of Lasso Regression From Scratch using Python
  • Implementation of Ridge Regression from Scratch using Python

K-Nearest Neighbors (KNN)

  • K-Nearest Neighbor(KNN) Algorithm
  • Implementation of Elastic Net Regression From Scratch
  • Brute Force Approach and its pros and cons
  • ML | Implementation of KNN classifier using Sklearn
  • Regression using k-Nearest Neighbors in R Programming

Support Vector Machines

Support vector machine (svm) algorithm.

  • Classifying data using Support Vector Machines(SVMs) in Python
  • Support Vector Regression (SVR) using Linear and Non-Linear Kernels in Scikit Learn
  • Major Kernel Functions in Support Vector Machine (SVM)

Decision Tree

  • Python | Decision tree implementation
  • CART (Classification And Regression Tree) in Machine Learning
  • Decision Tree Classifiers in R Programming
  • Python | Decision Tree Regression using sklearn

Ensemble Learning

  • Ensemble Methods in Python
  • Random Forest Regression in Python
  • ML | Extra Tree Classifier for Feature Selection
  • Implementing the AdaBoost Algorithm From Scratch
  • Gradient Boosting in ML
  • CatBoost in Machine Learning
  • LightGBM (Light Gradient Boosting Machine)
  • Stacking in Machine Learning

Generative Model

  • ML | Naive Bayes Scratch Implementation using Python
  • Applying Multinomial Naive Bayes to NLP Problems
  • Gaussian Process Classification (GPC) on the XOR Dataset in Scikit Learn
  • Gaussian Discriminant Analysis
  • Quadratic Discriminant Analysis
  • Basic Understanding of Bayesian Belief Networks
  • Hidden Markov Model in Machine learning

Time Series Forecasting

  • Components of Time Series Data
  • AutoCorrelation
  • How to Check if Time Series Data is Stationary with Python?
  • How to Perform an Augmented Dickey-Fuller Test in R
  • How to calculate MOVING AVERAGE in a Pandas DataFrame?
  • Exponential Smoothing in R Programming
  • Python | ARIMA Model for Time Series Forecasting

Clustering Algorithm

  • K means Clustering - Introduction
  • Hierarchical Clustering in Machine Learning
  • Principal Component Analysis(PCA)
  • ML | T-distributed Stochastic Neighbor Embedding (t-SNE) Algorithm
  • DBSCAN Clustering in ML | Density based clustering
  • Spectral Clustering in Machine Learning
  • Gaussian Mixture Model
  • ML | Mean-Shift Clustering

Convolutional Neural Networks

  • Introduction to Convolution Neural Network
  • Image Classifier using CNN
  • What is Transfer Learning?

Recurrent Neural Networks

  • Introduction to Recurrent Neural Network
  • Introduction to Natural Language Processing
  • NLP Sequencing
  • Bias-Variance Trade Off - Machine Learning

Reinforcement Learning

  • Reinforcement learning
  • Markov Decision Process
  • Q-Learning in Python
  • Deep Q-Learning
  • Natural Language Processing (NLP) Tutorial

Model Deployment and Productionization

  • Python | Build a REST API using Flask
  • How To Use Docker for Machine Learning?
  • Cloud Deployment Models

Advanced Topics

  • What is AutoML in Machine Learning?
  • Generative Adversarial Network (GAN)
  • Explanation of BERT Model - NLP
  • What is a Large Language Model (LLM)
  • Variational AutoEncoders
  • Transfer Learning with Fine-tuning in NLP
  • 100 Days of Machine Learning - A Complete Guide For Beginners
  • 100+ Machine Learning Projects with Source Code [2024]

Support Vector Machine (SVM) is a powerful machine learning algorithm used for linear or nonlinear classification, regression, and even outlier detection tasks. SVMs can be used for a variety of tasks, such as text classification, image classification, spam detection, handwriting identification, gene expression analysis, face detection, and anomaly detection. SVMs are adaptable and efficient in a variety of applications because they can manage high-dimensional data and nonlinear relationships.

SVM algorithms are very effective as we try to find the maximum separating hyperplane between the different classes available in the target feature.

Support Vector Machine

Support Vector Machine (SVM) is a supervised machine learning algorithm used for both classification and regression. Though we say regression problems as well it’s best suited for classification. The main objective of the SVM algorithm is to find the optimal hyperplane in an N-dimensional space that can separate the data points in different classes in the feature space. The hyperplane tries that the margin between the closest points of different classes should be as maximum as possible. The dimension of the hyperplane depends upon the number of features. If the number of input features is two, then the hyperplane is just a line. If the number of input features is three, then the hyperplane becomes a 2-D plane. It becomes difficult to imagine when the number of features exceeds three. 

Let’s consider two independent variables x 1 , x 2, and one dependent variable which is either a blue circle or a red circle.

how to solve svm optimization problem

Linearly Separable Data points  

From the figure above it’s very clear that there are multiple lines (our hyperplane here is a line because we are considering only two input features x 1 , x 2 ) that segregate our data points or do a classification between red and blue circles. So how do we choose the best line or in general the best hyperplane that segregates our data points?

How does SVM work?

One reasonable choice as the best hyperplane is the one that represents the largest separation or margin between the two classes. 

Multiple hyperplanes separating the data from two classes

Multiple hyperplanes separate the data from two classes

So we choose the hyperplane whose distance from it to the nearest data point on each side is maximized. If such a hyperplane exists it is known as the maximum-margin hyperplane/hard margin . So from the above figure, we choose L2. Let’s consider a scenario like shown below

Selecting hyperplane for data with outlier

Selecting hyperplane for data with outlier

Here we have one blue ball in the boundary of the red ball. So how does SVM classify the data? It’s simple! The blue ball in the boundary of red ones is an outlier of blue balls. The SVM algorithm has the characteristics to ignore the outlier and finds the best hyperplane that maximizes the margin. SVM is robust to outliers.

Hyperplane which is the most optimized one

Hyperplane which is the most optimized one

So in this type of data point what SVM does is, finds the maximum margin as done with previous data sets along with that it adds a penalty each time a point crosses the margin. So the margins in these types of cases are called soft margins . When there is a soft margin to the data set, the SVM tries to minimize (1/margin+∧(∑penalty)) . Hinge loss is a commonly used penalty. If no violations no hinge loss.If violations hinge loss proportional to the distance of violation.

Till now, we were talking about linearly separable data(the group of blue balls and red balls are separable by a straight line/linear line). What to do if data are not linearly separable?

Original 1D dataset for classification

Original 1D dataset for classification

Say, our data is shown in the figure above. SVM solves this by creating a new variable using a kernel . We call a point x i on the line and we create a new variable y i as a function of distance from origin o.so if we plot this we get something like as shown below

Mapping 1D data to 2D to become able to separate the two classes

Mapping 1D data to 2D to become able to separate the two classes

In this case, the new variable y is created as a function of distance from the origin. A non-linear function that creates a new variable is referred to as a kernel.

Support Vector Machine Terminology

  • Hyperplane: Hyperplane is the decision boundary that is used to separate the data points of different classes in a feature space. In the case of linear classifications, it will be a linear equation i.e. wx+b = 0.
  • Support Vectors: Support vectors are the closest data points to the hyperplane, which makes a critical role in deciding the hyperplane and margin. 
  • Margin : Margin is the distance between the support vector and hyperplane. The main objective of the support vector machine algorithm is to maximize the margin.  The wider margin indicates better classification performance.
  • Kernel : Kernel is the mathematical function, which is used in SVM to map the original input data points into high-dimensional feature spaces, so, that the hyperplane can be easily found out even if the data points are not linearly separable in the original input space. Some of the common kernel functions are linear, polynomial, radial basis function(RBF), and sigmoid.
  • Hard Margin: The maximum-margin hyperplane or the hard margin hyperplane is a hyperplane that properly separates the data points of different categories without any misclassifications.
  • Soft Margin: When the data is not perfectly separable or contains outliers, SVM permits a soft margin technique. Each data point has a slack variable introduced by the soft-margin SVM formulation, which softens the strict margin requirement and permits certain misclassifications or violations. It discovers a compromise between increasing the margin and reducing violations.
  • C: Margin maximisation and misclassification fines are balanced by the regularisation parameter C in SVM. The penalty for going over the margin or misclassifying data items is decided by it. A stricter penalty is imposed with a greater value of C, which results in a smaller margin and perhaps fewer misclassifications.
  • Hinge Loss: A typical loss function in SVMs is hinge loss. It punishes incorrect classifications or margin violations. The objective function in SVM is frequently formed by combining it with the regularisation term.
  • Dual Problem: A dual Problem of the optimisation problem that requires locating the Lagrange multipliers related to the support vectors can be used to solve SVM. The dual formulation enables the use of kernel tricks and more effective computing.

Mathematical intuition of Support Vector Machine

Consider a binary classification problem with two classes, labeled as +1 and -1. We have a training dataset consisting of input feature vectors X and their corresponding class labels Y.

The equation for the linear hyperplane can be written as:

w^Tx+ b = 0

The vector W represents the normal vector to the hyperplane. i.e the direction perpendicular to the hyperplane. The parameter b in the equation represents the offset or distance of the hyperplane from the origin along the normal vector w . 

The distance between a data point x_i and the decision boundary can be calculated as:

d_i = \frac{w^T x_i + b}{||w||}

where ||w|| represents the Euclidean norm of the weight vector w. Euclidean norm of the normal vector W

For Linear SVM classifier :

\hat{y} = \left\{ \begin{array}{cl} 1 & : \ w^Tx+b \geq 0 \\ 0 & : \  w^Tx+b  < 0 \end{array} \right.

Optimization:

  • For Hard margin linear SVM classifier:

\underset{w,b}{\text{minimize}}\frac{1}{2}w^Tw =\underset{W,b}{\text{minimize}}\frac{1}{2}\left\| w \right\|^{2} \\ \text{subject to}\; y_i(w^Tx_i + b) \geq 1 \;for\; i = 1, 2,3, \cdots,m

  • For Soft margin linear SVM classifier:

\underset{w,b}{\text{minimize }}\frac{1}{2}w^Tw+ C \sum_{i=1}^m \zeta_{i} \\ \text{subject to } y_i(w^Tx_i + b)\ge  1-\zeta_{i}\;\; and \; \zeta_{i} \ge 0\;\; for \; i = 1, 2,3, \cdots,m

  • Dual Problem: A dual Problem of the optimisation problem that requires locating the Lagrange multipliers related to the support vectors can be used to solve SVM. The optimal Lagrange multipliers α(i) that maximize the following dual objective function

\underset{\alpha}{\text{maximize}}: \frac{1}{2}\underset{i\to m\;}{\sum}\;\underset{j\to m}{\sum} \alpha_i\alpha_j t_i t_j K(x_i, x_j) -\underset{i\to m}{\sum}\alpha_i

  • α i is the Lagrange multiplier associated with the ith training sample.
  • K(x i , x j ) is the kernel function that computes the similarity between two samples x i and x j . It allows SVM to handle nonlinear classification problems by implicitly mapping the samples into a higher-dimensional feature space.
  • The term ∑α i represents the sum of all Lagrange multipliers.

The SVM decision boundary can be described in terms of these optimal Lagrange multipliers and the support vectors once the dual issue has been solved and the optimal Lagrange multipliers have been discovered. The training samples that have i > 0 are the support vectors, while the decision boundary is supplied by:

w= \underset{i\to m}{\sum}\alpha_i t_i K(x_i, x) + b \\ t_i(w^Tx_i-b) = 1 \Longleftrightarrow b= w^Tx_i-t_i

Types of Support Vector Machine

Based on the nature of the decision boundary, Support Vector Machines (SVM) can be divided into two main parts:

  • Linear SVM: Linear SVMs use a linear decision boundary to separate the data points of different classes. When the data can be precisely linearly separated, linear SVMs are very suitable. This means that a single straight line (in 2D) or a hyperplane (in higher dimensions) can entirely divide the data points into their respective classes. A hyperplane that maximizes the margin between the classes is the decision boundary.
  • Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot be separated into two classes by a straight line (in the case of 2D). By using kernel functions, nonlinear SVMs can handle nonlinearly separable data. The original input data is transformed by these kernel functions into a higher-dimensional feature space, where the data points can be linearly separated. A linear SVM is used to locate a nonlinear decision boundary in this modified space. 

Popular kernel functions in SVM

The SVM kernel is a function that takes low-dimensional input space and transforms it into higher-dimensional space, ie it converts nonseparable problems to separable problems. It is mostly useful in non-linear separation problems. Simply put the kernel, does some extremely complex data transformations and then finds out the process to separate the data based on the labels or outputs defined.

\begin{aligned} \text{Linear : } K(w,b) &= w^Tx+b \\ \text{Polynomial : } K(w,x) &= (\gamma w^Tx+b)^N \\ \text{Gaussian RBF: } K(w,x) &= \exp(-\gamma|| x_i-x_j||^n \\ \text{Sigmoid :} K(x_i, x_j) &= \tanh(\alpha x_i^Tx_j + b) \end{aligned}

Advantages of SVM

  • Effective in high-dimensional cases.
  • Its memory is efficient as it uses a subset of training points in the decision function called support vectors.
  • Different kernel functions can be specified for the decision functions and its possible to specify custom kernels.

SVM implementation in Python

Predict if cancer is Benign or malignant. Using historical data about patients diagnosed with cancer enables doctors to differentiate malignant cases and benign ones are given independent attributes.

  • Load the breast cancer dataset from sklearn.datasets
  • Separate input features and target variables.
  • Buil and train the SVM classifiers using RBF kernel.
  • Plot the scatter plot of the input features.
  • Plot the decision boundary.
  • Plot the decision boundary

Breast Cancer Classifications with SVM RBF kernel-Geeksforgeeks

Breast Cancer Classifications with SVM RBF kernel

Don't miss your chance to ride the wave of the data revolution! Every industry is scaling new heights by tapping into the power of data. Sharpen your skills and become a part of the hottest trend in the 21st century.

Dive into the future of technology - explore the Complete Machine Learning and Data Science Program by GeeksforGeeks and stay ahead of the curve.

Please Login to comment...

author

  • Machine Learning
  • uddhavpatil9560
  • abhishekm482g
  • sayyedaribhussain4321
  • pawan_kumar_gunjan

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

SVM Tutorial

SVM - Understanding the math - Duality and Lagrange multipliers

This is the Part 6  of my series of tutorials about the math behind Support Vector Machines. Today we will learn about duality, optimization problems and Lagrange multipliers.

If you did not read the previous articles, you might want to start the serie at the beginning by reading this article: an overview of Support Vector Machine .

 Duality

In mathematical optimization theory, duality means that optimization problems may be viewed from either of two perspectives, the primal problem or the dual problem ( the duality principle ). The solution to the dual problem provides a lower bound to the solution of the primal (minimization) problem.  ( Wikipedia )

The concept of duality is pretty simple to understand if you know what a lower bound is.

What is a lower bound?

  • Because 1 is less than or equal to 2, 4 ,8 and 12, I can say that 1 is a lower bound of S.
  • The same is true for -3 for instance.
  • And even if it is in S we can also call 2 a lower bound of S.

Moreover, because 2 is larger than any other lower bounds, we can give it a special name, we call it the infimum (or greatest lower bound).

So in our example, you can find an infinity of lower bound, but there is only one infimum.

Note : The same logic apply with the relation "greater than or equal" and we have the concept of  upper-bound and supremum .

Coming back to duality

Now that we know what a lower bound is, what do we understand about the definition of duality? Well, this definition means that if you have a minimization problem, you can also see it as a maximization problem .  And when you find the maximum of this problem, it will be a lower bound to the solution of the minimization problem, i.e. it will always be less than or equal to the minimum of the minimization problem .

Why do we care about duality?

It turns out that sometimes, solving the dual problem is simpler than solving the primal problem. From what we saw about lower bounds, we can see that for some problems solving the dual problem gives us the same result as solving the primal problem!   But when?

Let us look at a visual illustration.

Duality gap

Optimization problems with constraints

An optimization problem is typically written:

This notation is called the standard form. You should know that there are others notations as well.

The value we find MUST respect these constraints!

What does it mean to respect the constraints?

Imagine you try to solve the following optimization problem:

When there is no constraint the minimum is zero

Equality constraints

With an equality constraint x=1, the minimum is 1

Inequality constraint

What if we now use an inequality constraint? It gives us a problem of the form:

Inequality constraint

In mathematical notation we can write it:

Combining constraints

It is possible to add several constraints to an optimization problem. Here is an example with two inequality constraints and its visual representation:

Combining constraints restrict the feasible region

Note that we can also mix equality and inequality constraints together. The only restriction is that if we use contradictory constraints, we can end up with a problem which does not have a feasible set:

What is the solution of an optimization problem?

( Source (page 2) )

How do we find the solution to an optimization problem with constraints?

We will be using the Lagrange Multipliers. It is a method invented by the Italian mathematician, Joseph-Louis Lagrange around 1806.

Joseph Louis Lagrange

Lagrange multipliers

As often, we can find a pretty clear definition on Wikipedia:

In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints. ( Wikipedia )

The critical thing to note from this definition is that the method of Lagrange multipliers only works with equality constraints . So we can use it to solve some optimization problems: those having one or several equality constraints.

But don't worry, the Lagrange multipliers will be the basis used for solving problems with inequality constraints as well, so it is worth understanding this simpler case 🙂

Before explaining what the idea behind Lagrange multipliers is, let us refresh our memory about contour lines.

Contour lines

It is important to understand contour lines to  appreciate   better Lagrange multipliers.

A contour plot is a way to visualize a three-dimensional function into two dimensions.

contours

Key concepts regarding contour lines :

  • for each point on a line, the function returns the same value
  • the darker the area is, the smallest the value of the function is

contours_4

Moreover, the gradient  of a function can be visualized as a vector field, with the arrow pointing in the direction where the function is increasing:

gradient_vector_field

It turns out we can easily draw gradient vectors on a contour plot:

  • they are perpendicular to a contour line
  • they should point in the direction where the function is increasing

Here is a representation on two contours plots:

gradient_and_contour

Back to Lagrangian multipliers

Let us consider the following optimization problem:

how to solve svm optimization problem

It is interesting to note that we can combine both contour plot to visualize how the two functions behave on one plot. Below you can see the constraint function depicted by blue lines. Also, I draw some gradient vectors of the objective function (in black) and some gradient vectors of the constraint function (in white).

function_and_constraint_3

It means that we want points where:

function_and_constraint_4

In the figure below, we can see the point where the objective function and the feasible set are tangent. I also added some objective function gradient vectors in black and some constraint gradient vectors in white.

how to solve svm optimization problem

We can see that the is only one point where two vectors point in the same direction: it is the minimum of the objective function, under the constraint.

How does it translate mathematically?

Note that this formula does not require the gradients to have the same direction (multiplying by a negative number will change the direction), but only to be parallel. It is because it can be used to find both maximum and minimum at the same time ( see Example 1 of this article if you want to see how ).

How do we find points for which \nabla f(x,y) = \lambda \nabla g(x,y) ?

To make things a little easier, we notice that if we define a function:

Let us solve this example using the Lagrange multiplier method!

Remember, the problem we wish to solve is:

Step 1: We introduce the Lagrangian function

and its gradient is :

Step 2: We solve for its gradient

which means solving the following system of equations:

Let us verify if it makes sense with our graphical analysis on the graph below:

function_and_constraint_7

In this article, we learned an important concept in optimization: duality. Moreover, we discovered that an optimization problem can have equality and inequality constraints. Eventually, we learned what Lagrange multipliers are and how we can use them to solve an optimization problem with one equality constraint.

If you wish to learn more about Lagrange multipliers in the context of SVM, you can read this very good paper showing how to use them with more equality constraints and with inequality constraints.

When will next part be ready?

There is no next part, but instead I wrote a free e-book about Support Vector Machines . That's your next step if you wish to learn more about SVM!

Alexandre KOWALCZYK

I am passionate about machine learning and Support Vector Machine. I like to explain things simply to share my knowledge with people from around the world.

Nonlinear optimization and support vector machines

  • Original - Survey or Exposition
  • Open access
  • Published: 14 April 2022
  • Volume 314 , pages 15–47, ( 2022 )

Cite this article

You have full access to this open access article

  • Veronica Piccialli 1 &
  • Marco Sciandrone   ORCID: orcid.org/0000-0002-5234-3174 1  

3007 Accesses

12 Citations

Explore all metrics

This article has been updated

Support vector machine (SVM) is one of the most important class of machine learning models and algorithms, and has been successfully applied in various fields. Nonlinear optimization plays a crucial role in SVM methodology, both in defining the machine learning models and in designing convergent and efficient algorithms for large-scale training problems. In this paper we present the convex programming problems underlying SVM focusing on supervised binary classification. We analyze the most important and used optimization methods for SVM training problems, and we discuss how the properties of these problems can be incorporated in designing useful algorithms.

Similar content being viewed by others

Veronica Piccialli & Marco Sciandrone

how to solve svm optimization problem

Linear Support Vector Machines

how to solve svm optimization problem

Linear Classification of Data with Support Vector Machines and Generalized Support Vector Machines

Avoid common mistakes on your manuscript.

1 Introduction

The support vector machine (SVM) is widely used as a simple and efficient tool for linear and nonlinear classification as well as for regression problems. The basic training principle of SVM, motivated by statistical learning theory (Vapnik 1998 ), is that the expected classification error for unseen test samples is minimized, so that, SVMs define good predictive models.

In this paper, as done in Piccialli and Sciandrone ( 2018 ), we focus on supervised (linear and nonlinear) binary SVM classifiers, whose task is to classify objects (patterns) into two groups using the features describing the objects and a labelled dataset (the training set ). We will not enter into the details of statistical issues concerning SVM models, nor we will analyze the standard cross-validation techniques used for adjusting SVM hyperparameters in order to optimize the predictive performance as machine learning models. A suitable analysis of statistical and machine learning issues can be found, for instance, in Bishop ( 2006 ), Scholkopf and Smola ( 2001 ) and Shawe-Taylor and Cristianini ( 2004 ). Here we will limit our analysis to theoretical, algorithmic and computational issues related to the optimization problem underlying the training of SVMs.

SVM training requires solving (large-scale) convex programming problems, whose difficulties are mainly related to the possibly huge number of training instances, that leads to a huge number of either variables or constraints. The particular structure of the SVM training problems has favored the design and the development of ad hoc optimization algorithms to solve large-scale problems. Thanks to the convexity of the constrained problem, optimization algorithms for SVM are required to quickly converge towards any minimum. Thus the requirements are well-defined from an optimization point of view, and this has motivated a wide research activity (even of the optimization community) to define efficient and convergent algorithms for SVM training (see, for instance, Astorino and Fuduli 2015 ; Boser et al. 1992 ; Byrd et al. 2011 ; Carrizosa and Romero Morales 2013 ; Cortes and Vapnik 1995 ; Fan et al. 2008 ; Ferris and Munson 2004 ; Franc and Sonnenburg 2009 ; Fung and Mangasarian 2001 ; Gaudioso et al. 2017 ; Hsu and Lin 2002 ; Keerthi and Lin 2003 ; Lee et al. 2015 ; Lee and Mangasarian 2001 ; Mangasarian and Musicant 2001 ; Mangasarian 2006 ; Mavroforakis and Theodoridis 2006 ; Osuna et al. 1997 ; Glasmachers and Dogan 2013 ; Tsang et al. 2005 ; Wang and Lin 2014 ; Wang et al. 2012 . We observe that in neural network training, where the unconstrained optimization problem is nonconvex and suitable safeguards (for instance, early stopping) must be adopted in order to avoid converging too quickly towards undesired minima (in terms of generalization capabilities), the requirements of a training algorithm are not well-defined from an optimization point of view.

The SVM training problem can be equivalently formulated as a (linearly constrained) quadratic convex problem or, by Wolfe’s duality theory, as a quadratic convex problem with one linear constraint and box constraints. Depending on the formulation, several optimization algorithms have been specifically designed for SVM training. Thus, we present the most important contributions for the primal formulations , i.e., Newton methods, least-squares algorithms, stochastic sub-gradient methods, cutting plane algorithms, and for the dual formulations decomposition methods. Interior point methods were developed both for the primal and the dual formulations. We observe that the design of convergent and efficient decomposition methods for SVM training has yielded relevant advances both from a theoretical and computational point of view. Indeed, the “classical” decomposition methods for nonlinear optimization, such as the successive over-relaxation algorithm and the Jacobi and Gauss-Seidel algorithms, are applicable only when the feasible set is the Cartesian product of subsets defined in smaller subspaces. Since the SVM training problem contains an equality constraint, such methods cannot be directly employed, and this has motivated the study and the design of new decomposition algorithms improving the state-of-art.

The paper is organized as follows. We formally introduce in Sect. 2 the concept of optimal separating hyperplane underlying linear SVM, we give the primal formulation of the linear SVM training problem, and we recall the fundamental concepts of the Wolfe’s dual theory necessary for defining the dual formulation of the linear SVM training problem. The dual formulation allows us, through the so-called kernel trick , to immediately extend in Sect.  3 the approach of linear SVM to the case of nonlinear classifiers. Sections  4 and 5 contain the analysis of unconstrained and constrained methods, respectively, for the primal formulations. The wide class of decomposition methods for the dual formulation is analyzed in Section  6 . Interior point methods are presented in Sect.  7 . Finally, in Sect.  8 we direct the reader to the available software for SVM training related to the presented methods. In the appendices we provide the proofs of important results concerning: (1) the existence and uniqueness of the optimal hyperplane; (2) Wolfe’s dual theory both in the general and in the quadratic case; (3) the kernel functions. As regards (1), although the result is well-known, we believe that the kind of proof is novel and technically interesting. Concerning (2) and (3), they represent pillars of SVM methodology, and a reader might find them of interest to obtain some related technical insights.

2 The optimal separating hyperplane and linear SVM

The training set (TS) is a set of l observations:

The vectors \(x^i\) are the patterns belonging to the input space. The scalars \(y^i\) are the labels (targets). In a classification problem we have that \(y^i\in \{-1,1\}\) , in a regression problem \(y^i\in \Re \) . We will focus only on classification problems.

Let us consider two disjoint sets A and B of points in \(\Re ^n\) to be classified. Assume that A and B are linearly separable , that is, there exists a hyperplane \(H=\{x\in \Re ^n:~w^Tx+b=0\}\) such that the points \(x^i\in A\) belong to one half-space, and the points \(x^j\in B\) belong to the other half-space. More precisely, we can assume that there exist a vector \(w\in \Re ^n\) and a scalar \(b\in \Re \) such that

A hyperplane will be indicated by H ( w ,  b ). We say that H ( w ,  b ) is a separating hyperplane if the pair ( w ,  b ) is such that ( 1 ) holds. The decision function of a linear classifier associated with a separating hyperplane is \( f_d(x)={{\,\mathrm{sgn}\,}}(w^Tx+b). \) We introduce the concept of margin of a separating hyperplane.

Definition 1

Let H ( w ,  b ) be a separating hyperplane. The margin of H ( w ,  b ) is the minimum distance \(\rho \) between points in \(A\cup B\) and the hyperplane H ( w ,  b ), that is

It is quite intuitive that the margin of a given separating hyperplane is related to the generalization capability of the corresponding linear classifier, i.e., to correctly classify unseen data. The relationship between the margin and the generalization capability of linear classifiers is analyzed by the statistical learning theory (Vapnik 1998 ), which theoretically motivates the importance of defining the hyperplane with maximum margin , the so-called optimal separating hyperplane .

Definition 2

Given two linearly separable sets A and B , the optimal separating hyperplane is a separating hyperplane \(H(w^\star ,b^\star )\) having maximum margin.

It can be proved that the optimal hyperplane exists and is unique (see “Appendix A”). From the above definition we get that the optimal hyperplane is the unique solution of the following problem

It can be proved that problem ( 2 ) is equivalent to the convex quadratic programming problem

Now assume that the two sets A are B are not linearly separable. This means that the system of linear inequalities ( 1 ) does not admit solution. Let us introduce slack variables \(\xi _i\) , with \(i=1,\ldots ,l\) :

Note that whenever a vector \(x^i\) is not correctly classified the corresponding variable \(\xi ^i\) is greater than 1. The variables \(\xi _i\) corresponding to vectors correctly classified and belonging to the “separation zone” are such that \(0<\xi ^i<1\) . Therefore, the term \(\sum \nolimits _{i=1}^l\xi _i\) is an upper bound on the number of the classification errors on the training vectors. Then, it is quite natural to add to the objective function of problem ( 3 ) the term \(C\sum \nolimits _{i=1}^l\xi _i,\) where \(C>0\) is a parameter to assess the training error. The primal problem becomes

For reasons explained later, the dual problem of ( 5 ) is often considered. We direct the reader to Bertsekas ( 1999 ), Mangasarian ( 1994 ) and Fletcher ( 1987 ) for insights on duality in nonlinear programming. Let us consider the convex programming problem

where \(f:\Re ^n\rightarrow \Re \) is a convex, continuously differentiable function, \(A\in \Re ^{m\times n}\) , \(b\in \Re ^m\) . Introducing the Lagrangian function \(L(x,\lambda )=f(x)+\lambda ^T(Ax-b)\) , Wolfe’s dual of ( 6 ) is defined as follows

It can be proved (see Appendix B) that, if problem ( 6 ) admits a solution \(x^\star \) , then there exists a vector of Lagrange multipliers \(\lambda ^\star \) such that \((x^\star ,\lambda ^\star )\) is a solution of ( 7 ).

In the general case, given a solution \(({\bar{x}},{\bar{\lambda }})\) of Wolfe’s dual, we can not draw conclusions with respect to the primal problem ( 6 ). In the particular case of convex quadratic programming problems the following result holds (see “Appendix B”).

Proposition 1

Let \(f(x)={{1}\over {2}}x^TQx+c^Tx\) , and suppose that the matrix Q is symmetric and positive semidefinite. Let \(({\bar{x}},{\bar{\lambda }})\) be a solution of Wolfe’s dual ( 7 ). Then, there exists a vector \(x^\star \) (not necessarily equal to \({\bar{x}}\) ) such that

\(Q(x^\star -{\bar{x}})=0\) ;

\(x^\star \) is a solution of problem ( 6 ); and

\(x^*\) is a global minimum of ( 6 ) with associated multipliers \({\bar{\lambda }}\) .

Now let us consider the convex quadratic programming problem ( 5 ). Here the primal variables are \((w,b,\xi )\) , and the condition \(\nabla _xL(x,\lambda )=0\) gives two constraints

Then, setting \(X=\left[ y^1x^1,\ldots ,y^lx^l\right] \) , \(\lambda ^T=\left[ \lambda ^1,\ldots ,\lambda ^l\right] \) , Wolfe’s dual of ( 5 ) is a convex quadratic programming problem of the form

where \(e^T=[1,\ldots ,1]\) .

Once a solution \(\lambda ^\star \) is computed, the primal vector \(w^\star \) can be determined as follows

i.e., \(w^\star \) depends only on the so-called ( support vectors ) \(x^i\) whose corresponding multipliers \(\lambda _i^\star \) are not null. The support vectors corresponding to multipliers \(\lambda _i^\star \) such that \(0<\lambda _i^\star <C\) are called free support vectors , those corresponding to multipliers \(\lambda _i^\star =C\) are called bounded support vectors . We also observe that assertion (iii) of Proposition  1 ensures that an optimal solution \((w^\star ,b^\star )\) satisfies the complementarity conditions with multipliers equal to \(\lambda ^\star \) . Thus, by considering any free support vector \(x^i\) , we have \(0<\lambda _i^\star <C\) , which implies

so that, once \(w^\star \) is computed, the scalar \(b^\star \) can be determined by means of the corresponding complementarity condition defined by ( 9 ).

Finally, we observe that the decision function of a linear SVM is

Summarizing, we have that the duality theory leads to a convenient way to deal with the constraints. Moreover, the dual optimization problem can be written in terms of dot products, as well as the decision function, and this allows us to easily extend the approach to the case of nonlinear classifiers.

3 Nonlinear SVM

The idea underlying the nonlinear SVM is that of mapping the data of the input space onto a higher dimensional space called feature space and to define a linear classifier in this feature space.

Let us consider a mapping \( \phi :\Re ^n\rightarrow {{\mathcal {H}}} \) where \({{\mathcal {H}}}\) is an Euclidean space (the feature space ) whose dimension is greater than n (the dimension can be even infinite). The input training vectors \(x^i\) are mapped onto \(\phi (x^i)\) , with \(i=1,\ldots ,l\) .

We can think to define a linear SVM in the feature space by replacing \(x^i\) with \(\phi (x^i)\) . Then we have

the dual problem ( 8 ) is replaced by the following problem

the optimal primal vector \(w^\star \) is

given \(w^\star \) and any \(0<\lambda _i^\star < C\) , the scalar \(b^\star \) can be determined using the complementarity conditions

the decision function takes the form

The primal/dual relation in infinite dimensional spaces has been rigorously discussed in Lin ( 2001a ).

From ( 12 ) we get that the separation surface is:

linear in the feature space ;

non linear in the input space .

It is important to observe that both in the dual formulation ( 10 ) and in formula ( 12 ) concerning the decision function it is not necessary to explicitly know the mapping \(\phi \) , but it is sufficient to know the inner product \(\phi (x)^T\phi (z)\) of the feature space. This leads to the fundamental concept of kernel function .

Definition 3

Given a set \(X\subseteq \Re ^n\) , a function

is a kernel if

where \(\phi \) is an application \(X\rightarrow {{\mathcal {H}}}\) and \({{\mathcal {H}}}\) is an Euclidean space, that is, a linear space with a fixed inner product.

We observe that a kernel is necessarily a symmetric function. It can be proved that K ( x ,  z ) is a kernel if and only if the \(l\times l\) matrix

is positive semidefinite for any set of training vectors \(\{x^1,\ldots ,x^l\}\) . The kernel is often referred to as the Mercer kernel in the literature. We have the following result, whose proof is reported in “Appendix C”.

Proposition 2

Let \(K:X\times X\rightarrow \Re \) be a symmetric function. Then K is a kernel if and only if, for any choice of the vectors \(x^1,\ldots ,x^\ell \) in X the matrix

is positive semidefinite.

Using the definition of kernel problem ( 10 ) can be written as follows

By Proposition  2 it follows that problem ( 14 ) is a convex quadratic programming problem. Examples of kernel functions are:

\(K(x,z)=(x^Tz+1)^p\) polynomial kernel ( p integer \(\ge \) 1)

\(K(x,z)=e^{-\Vert x-z\Vert ^2/2\sigma ^2}\) Gaussian kernel ( \(\sigma >0\) )

\(K(x,z)=tanh(\beta x^Tz+\gamma )\) hyperbolic tangent kernel (for suitable values of \(\beta \) and \(\gamma \) )

It can be shown that the Gaussian kernel is an inner product in an infinite dimensional space. Using the definition of kernel function the decision function is

4 Unconstrained primal formulations

Let us consider the linearly constrained primal formulation ( 5 ) for linear SVM. It can be shown that problem ( 5 ) is equivalent to the following unconstrained nonsmooth problem

The above formulation penalizes slacks ( \(\xi \) ) linearly and is called \(L_1\) -SVM. An unconstrained smooth formulation is that of the so-called \(L_2\) -SVM, where slacks are quadratically penalized, i.e.,

Least Squares SVM (LS-SVM) considers the primal formulation ( 5 ), where the inequality constraints

are replaced by the equality constraints

This leads to a regularized linear least squares problem

The general unconstrained formulation takes the form

where R ( w ,  b ) is the regularization term and \(L(w,b;x^i,y^i)\) is the loss function associated with the observation \((x^i,y^i)\) .

We observe that the bias term b plays a crucial role both in the learning model, i.e., it may be critical for successful learning (especially in unbalanced datasets), and in the optimization-based training process. The simplest approach to learn the bias term is that of adding one more feature to each instance, with constant value equal to 1. In this way, in \(L_1\) -SVM, \(L_2\) -SVM and LS-SVM, the regularization term becomes \(\displaystyle {{{1}\over {2}}(\Vert w\Vert ^2+b^2)}\) with the advantages of having convex properties of the objective function useful for convergence analysis and the possibility to directly apply algorithms designed for models without the bias term. The conceptual disavantage of this approach is that the statistical learning theory underlying SVM models is based on an unregularized bias term. We will not go into the details of the issues concerning the bias term.

The extension of the unconstrained approach to nonlinear SVM, where the data \(x^i\) are mapped onto the feature space \({{\mathcal {H}}}\) by the mapping \( \phi :\Re ^n\rightarrow {{\mathcal {H}}} \) , are often done by means of the representer theorem (Kimeldorf and Wahba 1970 ). Using this theorem we have that the solution of SVM formulations can be expressed as a linear combination of the mapped training instances. Then, we can train a nonlinear SVM without direct access to the mapped instances, but using their inner products through the kernel trick. For instance, setting \(w={\sum \nolimits _{i=1}^l}\beta _i\phi (x^i)\) , the optimization problem corresponding to \(L_2\) -SVM with regularized bias term is the following unconstrained problem

where K is the kernel matrix associated to the mapping \(\phi \) and \(K_i\) is the \(i-\) th column. Note that both ( 16 ) and ( 19 ) are piecewise convex quadratic functions.

4.1 Methods for primal formulations

First let us consider the nonsmooth formulation ( 15 ) without considering the bias term b . A simple and effective stochastic sub-gradient descent algorithm has been proposed in Shalev-Shwartz et al. ( 2011 ). The vector w is initially set to 0. At iteration t , a pair \((x^{i_t},y^{i_t})\) is randomly chosen in the training set, and the objective function

is approximated as follows

The sub-gradient of \(f(w;i_t)\) is

where \(1\left[ y^{i_t}w_t^Tx^{i_t}<1\right] \) is the indicator function which takes the value one if its argument is true and zero otherwise. The vector w is updated as follows

where \(\eta _t={{1}\over {\lambda t}}, \) and \(\lambda >0\) . A more general version of the algorithm is the one based on mini-batch iterations, where instead of using a single example \((x^{i_t},y^{i_t})\) of the training set, a subset of training examples, defined by the set \(A_t\subset \{1,\ldots ,P\}\) , with \(|A_t|=r\) , is considered. The objective function is approximated as follows

whose sub-gradient is

The updating rule is again

In the deterministic case, that is, when all the training examples are used at each iteration, i.e., \(A_t=\{1,\ldots ,l\}\) , the complexity analysis shows that the number of iterations required to obtain an \(\epsilon -\) approximate solution is \( O(1/\lambda \epsilon ). \) In the stochastic case, i.e., \(A_t\subset \{1,\ldots ,l\}\) , a similar result in probability is given. We observe that the complexity analysis relies on the property that the objective function is \(\lambda -\) strongly convex , i.e.,

is a convex function.

The extension to nonlinear SVM is performed taking into account that, once mapped the input data \(x^i\) onto \(\phi (x^i)\) , thanks to the fact that w is initialized to 0, we can write

where \(\alpha _{t+1}[i]\) counts how many times example i has been selected so far and we had a non-zero loss on it. It can be shown that the algorithm does not require the explicit access to the weight vector w . To this aim, we show how the vector \(\alpha \) , initialized to zero, is iteratively updated. At iteration t , the index \(i_t\) is randomly chosen in \(\{1,\ldots ,l\}\) , and we set

then set \( \alpha _{t+1}[i_t]=\alpha _{t}[i_t]+1, \) otherwise set \( \alpha _{t+1}[i_t]=\alpha _{t}[i_t]. \) Thus, the algorithm can be implemented by maintaining the vector \(\alpha \) , using only kernel evaluations, without direct access to the feature vectors \(\phi (x)\) .

Newton-type methods for formulation ( 16 ) of \(L_2\) -SVM have been proposed first in Mangasarian ( 2002 ) and then in Keerthi and DeCoste ( 2005 ). The main difficulty of this formulation concerns the fact that the objective function is not twice continuously differentiable, so that the generalized Hessian must be considered. Finite convergence is proved in both papers. The main peculiarities of the algorithm designed in Keerthi and DeCoste ( 2005 ) are: (1) the formulation of a linear least square problem for computing the search direction (i.e., the violated constraints, depending on the current solution, are replaced by equality constraints); (2) the adoption of an exact line search for determining the stepsize. The matrix of the least square problem has a number of rows equal to \(n+n_{v}\) , where \(n_{v}\) is the number of violated inequality constraints , i.e., the constraints such that \(y^iw^Tx^i<1\) .

Newton optimization for problem ( 19 ) and the relationship with the dual formulation have been deeply discussed in Chapelle ( 2007 ). In particular, it is shown that the complexity of one Newton step is \(O(ln_{sv}+n_{sv}^3)\) , where again \(n_{sv}\) is the number of violated inequality constraints , i.e., the constraints such that \(y^i(\beta ^TK_i)<1\) .

In Chang et al. ( 2008 ), the primal unconstrained formulation for linear classification ( 18 ) is considered, with L2 regularization and L2 loss function, i.e., \(f(w)=\frac{1}{2} \Vert w\Vert ^2+C\sum _{i=1}^l \max \{0,1-y^iw^Tx^i\}^2\) . The authors propose a coordinate descent algorithm, where \(w^{k+1}\) is constructed by sequentially updating each component of \(w^k\) . Define

with \(w^{k,1} = w^k\) and \(w^{k,n+1}=w^{k+1}\) . In order to update the i -th component defining \(w^{k,i+1}\) , the following one variable unconstrained problem is approximately solved:

The obtained function is a piecewise quadratic function, and the problem is solved by means of a line search along the Newton direction computed using the generalized second derivative proposed in Mangasarian ( 2002 ). The authors prove that the algorithm converges to an \(\epsilon \) accurate solution in \(O\left( nC^3P^6(\#nz)^3\log (\frac{1}{\epsilon })\right) \) where #nz is total number of nonzero values of training data, and \(P=\max |x_j^i|\) .

Finally, standard algorithms for the least squares formulation ( 17 ) concerning LS-SVM have been presented in Suykens and Vandewalle ( 1999 ) and in Cassioli et al. ( 2013 ). In this latter paper an incremental recursive algorithm, which requires storing a square matrix (whose dimension is equal to the number of features of the data), has been employed and could be used, in principle, even for online learning.

5 Constrained primal formulations and cutting plane algorithms

A useful tool in optimization is represented by cutting planes technique. Depending on the class of problems, this kind of tool can be used for strengthening a relaxation, for solving a convex problem by means of a sequence of LP relaxations, or for making tractable a problem with an exponential number of constraints.

This type of machinery is applied in Joachims ( 2006 ), Joachims et al. ( 2009 ) and Joachims and Yu ( 2009 ) for training an SVM. The main idea is to reformulate SVM training as a problem with quadratic objective and an exponential number of constraints, but with only one slack variable that measures the overall loss in accuracy in the training set. The constraints are obtained as the combination of all the possible subsets of constraints in problem ( 5 ). Then, a master problem that is the training of a smaller size SVM is solved at each iteration, and the constraint that is most violated in the solution is added for the next iteration.

The advantage is that it can be proved that the number of iteration is bounded and the bound is independent on the size of the problem, but depends only on the desired level of accuracy.

More specifically, in Joachims ( 2006 ), the primal formulation ( 5 ) with \(b=0\) is considered where the error term is divided by the number of elements in the training set, i.e.,

Then, an equivalent formulation called Structural Classification SVM is defined:

This formulation corresponds to summing up all the possible subsets of the constraints in ( 20 ), and has an exponential number of constraints, one for each vector \({\mathbf{c}}\in \{0,1\}^l\) , but there is only one slack variable \(\xi \) . The two formulations can be shown to be equivalent, in the sense that any solution \(w^*\) of problem ( 21 ) is also a solution of problem ( 20 ), with \(\xi ^*=\frac{1}{l}\sum \xi _i^*\) . The proof relies on the observation that for any value of w the slack variables for the two problems that attain the minimum objective value satisfy \(\xi =\frac{1}{l}\sum \xi _i\) . Indeed, for a given w the smallest feasible slack variables in ( 20 ) are \(\xi _i =\max \{0,1-y^iw^Tx^i\}\) . In a similar way, in ( 21 ) for a given w the smallest feasible slack variable can be found by solving

However, problem ( 22 ) can be decomposed into l problems, one for each component of the vector \({{\mathbf {c}}}\) , i.e.,

so that the objective values of problems ( 20 ) and ( 21 ) coincide at the optimum. This equivalence result implies that it is possible to solve ( 21 ) instead of ( 20 ).

The advantage of this problem is that there is a single slack variable that is directly related to the infeasibility, since if \((w,\xi )\) satisfies all the constraints with precision \(\epsilon \) , then the point \((w,\xi +\epsilon )\) is feasible. This allows one to establish an effective and straightforward stopping criterion related to the accuracy on the training loss.

The cutting plane algorithm for solving problem ( 21 ) is the following:

Cutting Plane Algorithm

Data. The training set TS, C , \(\epsilon \) .

Inizialization. \({{\mathcal {W}}} = \emptyset \) .

update \((w,\xi )\) with the solution of

for \(i=1,\ldots ,l\)

set \({{\mathcal {W}}} = {{\mathcal {W}}} \cup \{{{\mathbf {c}}}\}\) .

Until \(\frac{1}{l} \sum _{i=1}^lc_i-\frac{1}{l} \sum _{i=1}^lc_iy^iw^Tx^i\le \xi +\epsilon \)

Return \((w,\xi )\)

This algorithm starts with an empty set of violated constraints, and then iteratively builds a sufficient subset of the constraints of problem ( 21 ). Step 1 solves the problem with the current set of constraints. The vector \({{\mathbf {c}}}\) computed at Step 2 corresponds to selecting the constraint in ( 21 ) that requires the largest \(\xi \) to make it feasible given the current w , i.e., it finds the most violated constraint. The stopping criterion implies that the algorithm stops when the accuracy on the training loss is considered acceptable. Problem ( 23 ) can be solved either by solving the primal or by solving the dual, with any training algorithm for SVM. It can be shown that the algorithm terminates after at most \( \displaystyle \max \left\{ \frac{2}{\epsilon },\frac{8CR^2}{\epsilon ^2}\right\} \) iterations, where \(R=\max _i\Vert x_i\Vert \) , and this number also bounds the size of the working set \({{\mathcal {W}}}\) to a constant that is independent on n and l . Furthermore, for a constant size of the working set \({{\mathcal {W}}}\) , each iteration takes O ( sl ), where s is the number of nonzero features for each element of the working set. This algorithm is thus extremely competitive when the problem is highly sparse, and has been extended to handle structural SVM training in Joachims et al. ( 2009 ). It is also possible to obtain a straightforward extension of this approach to non linear kernels, defining a dual version of the algorithm. However, whereas the fixed number of iteration properties does not change, the time complexity per iteration worsens significantly, becoming \(O(m^3+ml^2)\) where m is the number of constraints added in the primal. The idea in Joachims and Yu ( 2009 ) is then to use arbitrary basis vectors to represent the learned rule, not only the support vectors, in order to find sparser solutions and keep the iteration cost lower. In particular, instead of using the Representer Theorem, setting \(w=\sum _{i=1}^l\alpha _iy^i\phi (x^i)\) and considering the subspace \({{\mathcal {F}}} = \mathrm{span}\{\phi (x^1),\ldots ,\phi (x^l)\}\) , they consider a smaller subspace \({{\mathcal {F}}}^\prime = \mathrm{span}\{\phi (b^1),\ldots ,\phi (b^k)\}\) for some small k and the basis vectors \(b^i\) are built during the algorithm. In this setting, each iteration has time complexity at most \(O(m^3 +mk^2+kl)\) .

Finally in Hui et al. ( 2010 ) and Le et al. ( 2008 ) a bundle method is defined for regularized risk minimization problems, that is shown to converge in \(O(1/\epsilon )\) steps for linear classification problems, and that is further optimized in Franc and Sonnenburg ( 2009 ) and Franc and Sonnenburg ( 2008 ), where an optimized choice of the cutting planes is described.

6 Decomposition algorithms for the dual formulation

Let us consider the convex quadratic programming problem for SVM training in the case of classification problems:

where \(\alpha \in \Re ^l\) , l is the number of training data, Q is a \(l\times l\) symmetric and positive semidefinite matrix, \(e\in \Re ^l\) is the vector of ones, \(y\in \{-1,1\}^l\) , and C is a positive scalar. The generic element \(q_{ij}\) of Q is \(y_iy_jK(x^i,x^j)\) , where \(K(x,z)=\phi (x)^{T}\phi (z)\) is the kernel function related to the nonlinear function \(\phi \) that maps the data from the input space into the feature space. We prefer to adopt here the symbol \(\alpha \) (instead of \(\lambda \) as in ( 14 )) for the dual variables, since it is a choice of notation often adopted in the SVM literature.

The structure of problem ( 24 ) is very simple, but we assume that the number l of training data is huge (as in many big data applications) and the Hessian matrix Q, which is dense, cannot be fully stored so that standard methods for quadratic programming cannot be used. Hence, the adopted strategy to solve the SVM problem is usually based on the decomposition of the original problem into a sequence of smaller subproblems obtained by fixing subsets of variables.

We remark that the need to design specific decomposition algorithms, instead of the well-known block coordinate descent methods, arises from the presence of the equality constraints that, in particular, makes the convergence analysis difficult. The ”classical” decomposition methods for nonlinear optimization, such as the successive over-relaxation algorithm and the Jacobi and GaussSeidel algorithms (Bertsekas 1999 ), are applicable only when the feasible set is the Cartesian product of subsets defined in smaller subspaces.

In a general decomposition framework, at each iteration k , the vector of variables \(\alpha ^k\) is partitioned into two subvectors \((\alpha _W^k,\alpha ^k_{\overline{W}})\) , where the index set \(W\subset \{1,\ldots ,l\}\) identifies the variables of the subproblem to be solved and is called working set , and \(\overline{W}=\{1,\dots ,l\}\setminus {W}\) (for notational convenience, we omit the dependence on k ).

Starting from the current solution \(\alpha ^k=(\alpha _W^k,\alpha _{\overline{W}}^k)\) , which is a feasible point, the subvector \(\alpha _{W}^{k+1}\) is computed as the solution of the subproblem

The variables corresponding to \(\overline{ W}\) are unchanged, that is, \(\alpha _{\overline{W}}^{k+1}= \alpha _{\overline{W}}^{k} \) , and the current solution is updated setting \(\alpha ^{k+1}=(\alpha _W^{k+1},\alpha _{\overline{W}}^{k+1})\) . The general framework of a decomposition scheme is reported below.

Decomposition framework

Data. A feasible point \(\alpha ^0\) (usually \(\alpha ^0=0\) ).

Inizialization. Set \(k=0\) .

While the stopping criterion is not satisfied

select the working set \({W}^k\) ;

set \(W=W^k\) and compute a solution \(\alpha ^*_{{W}}\) of the problem ( 25 );

set \(\alpha ^{k+1}_i= {\left\{ \begin{array}{ll}\alpha ^*_i &{} for \ i\in W\\ &{}\\ \alpha ^k_i &{} otherwise;\\ \end{array}\right. }\)

set \( \nabla f(\alpha ^{k+1})= \nabla f(\alpha ^k)+Q\left( \alpha ^{k+1}-\alpha ^k\right) .\)

set \(k=k+1\) .

Return \(\alpha ^*=\alpha ^k\)

The choice \(\alpha ^0=0\) for the starting point is motivated by the fact that this point is a feasible point and such that the computation of the gradient \(\nabla f(\alpha ^0)\) does not require any element of the matrix Q , being \( \nabla f(0)=-e. \) The cardinality q of the working set, namely the dimension of the subproblem, must be greater than or equal to 2, due to the presence of the linear constraint, otherwise we would have \(\alpha ^{k+1}=\alpha ^k\) .

The selection rule of the working set strongly affects both the speed of the algorithm and its convergence properties. In computational terms, the most expensive step at each iteration of a decomposition method is the evaluation of the kernel to compute the columns of the Hessian matrix, corresponding to the indices in the working set W . These columns are needed for updating the gradient.

We distinguish between:

Sequential minimal optimization (SMO) algorithms, where the size of the working set is exactly equal to two; and

General decomposition algorithms , where the size size of the working set is strictly greater than two.

In the sequel we will mainly focus on SMO algorithms, since they are the most used algorithms to solve large quadratic programs for SVM training.

6.1 Sequential minimal optimization (SMO) algorithms

The decomposition methods usually adopted are the so-called “sequential minimal optimization” (SMO) algorithms, since at each iteration they update the minimum number of variables, that is two. At each iteration, an SMO algorithm requires the solution of a convex quadratic programming of two variables with one linear equality constraint and box constraints. Note that the solution of a subproblem in two variables of the above form can be analytically determined (and this is one of the reasons motivating the interest in defining SMO algorithms). SMO algorithms were the first methods proposed for SVM training and the related literature is wide (see, e.g., Joachims 1999 ; Keerthi and Gilbert 2002 ; Lin 2001b ; Osuna et al. 1997 ; Platt 1999 ).

The analysis of SMO algorithms relies on feasible and descent directions having only two nonzero elements. In order to characterize these directions, given a feasible point \(\bar{\alpha }\) , let us introduce the following index sets

The introduction of the index sets \(R(\alpha )\) and \(S(\alpha )\) allows us to state the optimality conditions in the following form (see, e.g., Lucidi et al. 2007 ).

Proposition 3

A feasible point \(\alpha ^*\) is a solution of ( 24 ) if and only if

Given a feasible point \({\bar{\alpha }}\) , which is not a solution of problem ( 24 ), a pair \(i\in R({\bar{\alpha }})\) , \(j\in S({\bar{\alpha }})\) such that

is said to be a violating pair .

Given a violating pair ( i ,  j ), let us consider the direction \(d^{i,j}\) with two nonzero elements defined as follows

It can be easily shown that \(d^{i,j}\) is a feasible direction at \({\bar{\alpha }}\) and we have \( \nabla f({\bar{\alpha }})^Td^{i,j}<0, \) i.e., \(d^{i,j}\) is a descent direction. This implies that the selection of a violating pair of an SMO-type algorithm implies a strict decrease of the objective function. However, the use of generic violating pairs as working sets is not sufficient to guarantee convergence properties of the sequence generated by a decomposition algorithm.

A convergent SMO algorithm can be defined using as indices of the working set those corresponding to the“maximal violation” of the KKT conditions. More specifically, given again a feasible point \(\alpha \) which is not a solution of problem ( 24 ), let us define

Taking into account the KKT conditions as stated in ( 27 ), a pair \(i\in I(\alpha )\) , \(j\in J(\alpha )\) most violates the optimality conditions, and therefore, it is said to be a maximal violating pair . Note that the selection of the maximal violating pair involves O ( l ) operations. An SMO-type algorithm using maximal violating pairs as working sets is usually called most violating pair (MVP) algorithm which is formally described below.

SMO-MVP Algorithm

Data. The starting point \(\alpha ^0=0\) and the gradient \(\nabla f(\alpha ^0)=e\) .

select \(i\in I(\alpha ^k)\) , \( j\in J(\alpha ^k)\) , and set \(W=\{i,j\}\) ;

compute analytically a solution \(\alpha ^*=\left( \alpha _i^\star \quad \alpha _j^\star \right) ^T \) of ( 25 );

set \(\alpha ^{k+1}_h={\left\{ \begin{array}{ll}\alpha ^*_i &{} for \ h=i\\ &{}\\ \alpha ^*_j &{} for \ h=j\\ &{}\\ \alpha ^k_h &{} otherwise;\\ \end{array}\right. }\)

set \(\nabla f(\alpha ^{k+1})=\nabla f(\alpha ^k)+(\alpha _i^{k+1}-\alpha _i^k)Q_i+(\alpha _j^{k+1}-\alpha _j^k)Q_j\) ;

The scheme requires storing a vector of size l (the gradient \(\nabla f(\alpha ^k)\) ) and to get two columns, \(Q_i\) and \(Q_j\) , of the matrix Q .

We remark that the condition on the working set selection rule, i.e., \(i\in I(\alpha ^k)\) , \( j\in J(\alpha ^k)\) , can be viewed as a Gauss-Soutwell rule, since it is based on the maximum violation of the optimality conditions. It can be proved (see Lin 2001b , 2002a ) that SMO-MVP Algorithm is globally convergent provided that the Hessian matrix Q is positive semidefinite.

A usual requirement to establish convergence properties in the context of a decomposition strategy is that

Indeed, in a decomposition method, at the end of each iteration k , only the satisfaction of the optimality conditions with respect to the variables associated to W is ensured. Therefore, to get convergence towards KKT points, it may be necessary to ensure that consecutive points, which are solutions of the corresponding subproblems, tend to the same limit point.

It can be proved (Lin 2002a ) that SMO algorithms guarantee property ( 28 ) (the proof fully exploits that the subproblems are convex, quadratic problems into two variables).

The global convergence result of SMO algorithms can be obtained even using working set rules different from that selecting the maximal violating pair. For instance, the so-called constant-factor violating pair rule (Chen et al. 2006 ) guarantees global convergence properties of the SMO algorithm adopting it, and requires to select any violating pair \(u\in R(\alpha ^k)\) , \(v\in S(\alpha ^k)\) such that

where \(0<\sigma \le 1\) and ( i ,  j ) is a maximal violating pair.

The SMO-MVP algorithm is globally convergent and is based on first order information, since the maximal violating pair is related to the minimization of the first order approximation:

An SMO algorithm using second order information has been proposed in Fan et al. ( 2005 ), where the designed working set selection rule takes into account that f is quadratic and we can write

In particular, the working set selection rule of Fan et al. ( 2005 ) exploits second order information using ( 30 ), requires O ( l ) operations, and provides a pair defining the working set which is a constant-factor violating pair. Then, the resulting SMO algorithms, based on second order information, is globally convergent. Other convergent SMO algorithms, not based on the MVP selection rule, have been proposed in Chang et al. ( 2000 ), Lin et al. ( 2009 ) and Lucidi et al. ( 2007 ).

We conclude the analysis of SMO algorithms focusing on the stopping criterion. To this aim let us introduce the functions \(m(\alpha )\) , \(M(\alpha )\) :

where \(R(\alpha )\) and \(S(\alpha )\) are the index sets previously defined. From the definitions of \(m(\alpha )\) and \(M(\alpha )\) , and using Proposition  3 , it follows that \(\bar{\alpha }\) is solution of ( 24 ) if and only if \( m(\bar{\alpha }) \le M({\bar{\alpha }}) \) .

Let us consider a sequence of feasible points \(\{\alpha ^k\}\) converging to a solution \({\bar{\alpha }}\) . At each iteration k , if \(\alpha ^k\) is not a solution then (using again Proposition 3 ) we have \( m(\alpha ^k) > M(\alpha ^k) \) .

Therefore, one of the adopted stopping criteria is

where \(\epsilon >0 \) .

Note that the functions \(m(\alpha )\) and \(M(\alpha )\) are not continuous. Indeed, even assuming \(\alpha ^k\rightarrow {\bar{\alpha }}\) for \(k\rightarrow \infty \) , it may happen that \(R(\alpha ^k)\ne R({\bar{\alpha }})\) or \(S(\alpha ^k)\ne S({\bar{\alpha }})\) for k sufficiently large. However, it can be proved (Lin 2002b ) that an SMO Algorithm using the constant-factor violating pair rule generates a sequence \(\{\alpha ^k\}\) such that \(m(\alpha ^k)-M(\alpha ^k)\rightarrow 0\) for \(k\rightarrow \infty \) . Hence, for any \(\epsilon >0\) , an SMO algorithm of this type satisfies the stopping criterion ( 31 ) in a finite number of iterations. To our knowledge, this finite convergence result has not been proved for other asymptotically convergent SMO algorithms not based on the constant-factor violating pair rule.

6.2 General decomposition algorithms

In this section we briefly present decomposition algorithms using working sets of size greater than two. To this aim we will refer to the decomposition framework previously defined. The distinguishing features of the decomposition algorithms are:

the working set selection rule; and

the iterative method used to solve the quadratic programming subproblem.

The dimension of the subproblems is usually on the order of ten variables. A working set selection rule, based on the violation of the optimality conditions of Proposition  3 , has been proposed in Joachims ( 1999 ) and analyzed in Lin ( 2001b ). The rule includes, as particular case, the one selecting the most violating pair and used by SMO-MVP algorithm. Let \(q\ge 2\) be an even integer defining the size of the working set W . The working set selection rule is the following.

Select q /2 indices in \(R(\alpha ^k)\) sequentially so that

with \(i_1\in I(\alpha _k)\) .

Select q /2 indices in \(S(\alpha ^k)\) sequentially so that

with \(j_1\in J(\alpha _k)\) .

Set \(W = \{i_1,i_2,\ldots ,i_{q/2},j_1,j_2,\ldots , j_{q/2} \}\) .

Note that the working set rule employed by the SMO-MVP algorithm is a particular case of the above rule, with \(q=2\) . The asymptotic convergence of the decomposition algorithm based on the above working set rule and on the computation of the exact solution of the subproblem has been established in Lin ( 2001b ) under the assumption that the objective function is strictly convex with respect to block components of cardinality less than or equal to q . This assumption is used to guarantee condition ( 28 ), but it may not hold, for instance, if some training data are the same. A proximal point-based modification of the subproblem has been proposed in Palagi and Sciandrone ( 2005 ), and the global convergence of the decomposition algorithm using the above working set selection rule has been proved without strict convexity assumptions on the objective function.

We observe that the above working set selection rule (see (i–ii)) requires considering subproblem variables that mostly violate (in a decreasing order) the optimality conditions. This guarantees global convergence, but the degree of freedom for selecting the whole working set is limited. An open theoretical question concerns the convergence of a decomposition algorithm where the working set, besides the most violating pair, includes other arbitrary indices. This issue is very important to exploit the use of a caching technique that allocates some memory (the cache) to store the recently used columns of the Hessian matrix, thus avoiding in some cases the recomputation of these columns. To minimize the number of kernel evaluations and to reduce the computational time, it is convenient to select working sets containing as many elements corresponding to columns stored in the cache memory as possible. However, to guarantee the global convergence of a decomposition method, the working set selection cannot be completely arbitrary. The study of decomposition methods specifically designed to couple both the theoretical aspects of convergence and an efficient use of a caching strategy has motivated some works (see, e.g., Glasmachers and Igel 2006 ; Lin et al. 2009 ; Lucidi et al. 2009 ).

Concerning point (b), we observe that a closed form of the solution of the subproblem whose dimension is greater than two is not available, and this motivates the need to adopt an iterative method. In Joachims ( 1999 ) a primal-dual interior-point solver is used to solve the quadratic programming subproblems.

Gradient projection methods are suitable methods since they consist in a sequence of projections onto the feasible region that are inexpensive due to the special structure of the feasible set of ( 25 ). In fact, a projection onto the feasible set can be performed by efficient algorithms like those proposed in Dai and Fletcher ( 2006 ), Kiwiel ( 2008 ), and Pardalos and Kovoor ( 1990 ). Gradient projection methods for SVM have been proposed in Dai and Fletcher ( 2006 ) and Serafini and Zanni ( 2005 ).

Finally, the approach proposed in Mangasarian and Musicant ( 1999 ), where the square of the bias term is added to the objective function, leads by the Wolfe dual to a quadratic programming problem with only box constraints, called Bound-constrained SVM formulation (BSVM). In Hsu and Lin ( 2002 ), this simpler formulation has been considered, suitable working set selection rules have been defined, and the software TRON (Lin and Morè 1999 ), designed for large sparse bound-constrained problems, has been adapted to solve small (say of dimension 10) fully dense subproblems.

In Hsieh et al. ( 2008 ), by exploiting the bound-constrained formulation for the specific class of linear SVM, a dual coordinate descent algorithm has been defined where the dual variables are updated once at a time. The subproblem is solved analytically, the algorithm converges with convergence rate at least linear, and obtains an \(\epsilon \) -accurate solution in \(O(\log (1/\epsilon ))\) iterations. A parallel version has been defined in Chiang et al. ( 2016 ). Also in Glasmachers and Dogan ( 2013 ) an adaptive coordinate selection has been introduced that does not select all coordinates equally often for optimization. Instead, the relative frequencies of coordinates are subject to online adaptation leading to a significant speedup.

7 Interior point methods

Interior point methods are a valuable option for solving convex quadratic optimization problems of the form

Primal-dual interior point methods consider at each step a perturbed version of the (necessary and sufficient) primal dual optimality conditions,

where \(S = \mathrm Diag(s), V = \mathrm Diag(v), Z = \mathrm Diag(z), U = \mathrm Diag(u)\) , and solve this system by applying the Newton method, i.e., compute the search direction \((\varDelta z,\varDelta \lambda ,\varDelta s,\varDelta v)\) by solving:

for suitable residuals. The variables \( \varDelta s\) and \(\varDelta v\) can be eliminated, obtaining the augmented system:

where \(\varTheta \equiv Z^{-1}S+(U-Z)^{-1}V\) and \(r_c\) and \(r_b\) are suitable residuals. Finally \(\varDelta z\) is eliminated, ending up with the normal equations, that require calculating

and factorizing it to solve \(M\varDelta \lambda = -{\hat{r}}_b\) .

The advantage of interior point methods is that the number of iterations is almost independent of the size of the problem, whereas the main computational burden at each iteration is the solution of system ( 38 ). IPMs have been applied to linear SVM training in Ferris and Munson ( 2002 ), Fine and Scheinberg ( 2001 ), Gertz and Griffin ( 2010 ), Goldfarb and Scheinberg ( 2008 ), Woodsend and Gondzio ( 2009 ) and Woodsend and Gondzio ( 2011 ). The main differences are the formulations of the problem considered and the linear algebra tools used in order to solve the corresponding system ( 38 ).

In Ferris and Munson ( 2002 ), different versions of the primal-dual pair for SVM are considered: the standard one, given by ( 5 ) and ( 8 ), is one where the bias term is included in the objective function:

with corresponding dual

The simplest situation for IPMs is problem ( 43 ), where the linear system ( 38 ) simplifies into

with \( C=\frac{1}{2C}I+\varTheta ^{-1}\) , \(H=I\) and \(R = Y[X^T -e]\) . The matrix \(C+RR^T\) can be easily inverted using the Sherman-Morrison-Woodbury formula (Golub and Van Loan 1996 ):

where \(C^{-1}\) and \(H^{-1}\) are diagonal and positive definite and the matrix \( H^{-1}+R^TC^{-1}R\) is of size n and needs to be computed only once per iteration. The approach can be extended by using some (slightly more complex) variations of this formula for solving ( 41 ), whereas for solving problem ( 8 ) some proximal point is needed.

In Gertz and Griffin ( 2010 ), an interior point method is defined for solving the primal problem ( 5 ). In this case, we consider the dual variables \(\alpha \) associated to the classification constraints with the corresponding slack variables s , and \(\mu \) the multipliers associated to the nonnegativity constraints on the \(\xi \) vector. In this case, the primal dual optimality conditions lead to the following reduced system:

where \(\varOmega = \mathrm Diag(\alpha )^{-1}S+\mathrm Diag(\mu )^{-1}\mathrm Diag(\xi ) \) . By row elimination, system ( 46 ) can be transformed into

Finally, this system can be reduced into

where \(y_d=X^TY\varOmega ^{-1}y\) . The main cost in solving this system is computing and factorizing the matrix \(I+X^TY\varOmega ^{-1}YX-\frac{1}{y^T\varOmega ^{-1}y}y_d^Ty_d\) . In Gertz and Griffin ( 2010 ), the idea is to solve system ( 48 ) by a preconditioned linear conjugate gradient that requires only a mechanism for computing matrix-vector products of the form Mx , and they define a new preconditioner exploiting the structure of problem ( 5 ). The method is applicable when the number of features n is relatively large. Both the methods proposed in Ferris and Munson ( 2002 ) and Gertz and Griffin ( 2010 ) exploit the Sherman-Morrison-Woodbury formula, but it has been shown (see Goldfarb and Scheinberg 2008 ) that this approach leads to numerical issues, especially when the matrix \(\varTheta \) (or \(\varOmega \) ) is ill-conditioned and if there is near degeneracy in the matrix XY , which occurs if there are multiple samples close to the separating hyperplane.

In Goldfarb and Scheinberg ( 2008 ), an alternative approach is proposed for solving problem ( 8 ) (note that in this section we stick to the notation \(\alpha \) instead of \(\lambda )\) based on Product Form Cholesky Factorization. Here it is assumed that the matrix \(YX^TXY\) can be approximated by a low rank matrix \(VV^T\) , and an efficient and numerically stable Cholesky Factorization of the matrix \(VV^T+ \mathrm Diag(\alpha )^{-1}S+(\mathrm Diag(Ce)-\mathrm Diag(\alpha ))^{-1}Z\) is computed. The advantage with respect to methods using the SMW formula is that the \(LDL^T\) Cholesky factorization of the IPM normal equation matrix enjoys the property that the matrix L is stable even if D becomes ill-conditioned.

A different approach to overcoming the numerical issues related to the SMW formula is the one described in Woodsend and Gondzio ( 2011 ), where a primal-dual interior point method is proposed based on a different formulation of the training problem. In particular, the authors consider the dual formulation ( 8 ), and include the substitution

in order to get the following primal-dual formulation:

In order to apply standard interior point methods, that require all the variables to be bounded, some bounds are added on the variable w , so that the problem to be solved becomes:

The advantage of this formulation is that the objective function matrix Q is sparse, since it only has a non zero diagonal block corresponding to w (that is the identity matrix). Specializing the matrix M in ( 39 ) for this specific problem, if we define

Building the matrix M is the most expensive operation, of order \(\mathcal {O}(l(n+1)^2)\) while inverting the matrix is of order \(\mathcal {O}((n+1)^3)\) . In order to get the optimal hyperplane, it is possible to directly get the bias b since it is the element of \(\lambda \) corresponding to the constraint \(y^T\alpha = 0\) .

The method uses as stopping condition the stability of the set of support vectors monitored by measuring the change in the angle \(\phi \) of the normal to the hyperplane between iterations i and \(i-1\) :

Furthermore the number of iterations of IPMs can be reduced by using multiple correctors (that all use the same factorization of M ) to improve the centrality of the current point, and also an accurate estimate of the bounds on w can help to speed up the approach.

A parallel version of this algorithm has been introduced in Woodsend and Gondzio ( 2009 ).

Most of the methods described in this survey are open source and can be downloaded. Here we report for the reader’s convenience a list of the algorithms and the link to the corresponding software.

Algorithms for solving SVM in the primal:

the stochastic sub-gradient methods described in Shalev-Shwartz et al. ( 2011 ) are implemented in the software PEGASOS that can be downloaded from https://www.cs.huji.ac.il/~shais/code/index.html

The cutting plane algorithm proposed in Joachims ( 2006 ) is implemented in the software \(\hbox {SVM}^{perf}\) that can be downloaded from http://www.cs.cornell.edu/people/tj/svm_light/svm_perf.html

Interior point Methods:

The methods described in Ferris and Munson ( 2002 ); Gertz and Griffin ( 2010 ) are part of the software for quadratic programming OOQP downloadable at http://pages.cs.wisc.edu/~swright/ooqp/

The method described in Woodsend and Gondzio ( 2011 ) is implemented in the software SVM-OOPS that can be downloaded at http://www.maths.ed.ac.uk/ERGO/svm-oops/

Decomposition methods for solving SVM in the dual:

SMO-type algorithms and general decomposition algorithms have been implemented both in the software \(\hbox {SVM}^{light}\) that can be downloaded at http://svmlight.joachims.org/ and in the software LIBSVM that can be downloaded at https://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html .

An efficient library for linear classification is implemented in the software LIBLINEAR that can be downloaded at https://www.csie.ntu.edu.tw/~cjlin/liblinear/

9 Concluding remarks

In this paper we have presented an overview of the nonlinear optimization methods for SVM training, which typically involves convex programming problems, whose difficulties are related to the dimensions, i.e., to the number of training instances, and/or to the number of features. We have considered different equivalent formulations, pointing out the main theoretical and computational difficulties of the problems. We have described the most important and used optimization methods for SVM training, and we have discussed how the algorithms have been specifically designed and adapted to take into account the structure of the considered problems.

In our analysis we have limited ourselves to models and algorithms for binary classification since by nature SVM are mainly binary classifiers. Although the paper is a survey, in a field as vast as SVM we had to leave out several related important topics, such as multiclass-classification, one-class SVM, support vector regression, semi-supervised SVM, and online incremental SVM. However, we believe that most of the concepts, models and algorithms developed for SVM binary classification may represent a sound and useful basis to analyze the other classes of SVM models.

Change history

21 july 2022.

Missing Open Access funding information has been added in the Funding Note.

Astorino, A., & Fuduli, A. (2015). Support vector machine polyhedral separability in semisupervised learning. Journal of Optimization Theory and Applications, 164 , 1039–1050.

Article   Google Scholar  

Bertsekas, D. P. (1999). Nonlinear programming . Athena Scientific.

Bishop, C. M. (2006). Pattern recognition and machine learning . Springer.

Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on computational learning theory, COLT ’92 (pp. 144–152) New York, NY, USA, ACM.

Byrd, R. H., Chin, G. M., Neveitt, W., & Nocedal, J. (2011). On the use of stochastic hessian information in optimization methods for machine learning. SIAM Journal on Optimization, 21 , 977–995.

Carrizosa, E., & Romero Morales, D. (2013). Supervised classification and mathematical optimization. Computers and Operation Research, 40 , 150–165.

Cassioli, A., Chiavaioli, A., Manes, C., & Sciandrone, M. (2013). An incremental least squares algorithm for large scale linear classification. European Journal of Operational Research, 224 , 560–565.

Chang, C. C., Hsu, C. W., & Lin, C. J. (2000). The analysis of decomposition methods for support vector machines. IEEE Transactions on Neural Networks and Learning Systems, 11 , 1003–1008.

Chang, K. W., Hsieh, C. J., & Lin, C. J. (2008). Coordinate descent method for large-scale l2-loss linear support vector machines. Journal of Machine Learning Research, 9 , 1369–1398.

Google Scholar  

Chapelle, O. (2007). Training a support vector machine in the primal. Neural Computation., 19 , 1155–1178.

Chen, P. H., Fan, R. E., & Lin, C. J. (2006). A study on SMO-type decomposition methods for support vector machines. IEEE Transactions on Neural Networks, 17 , 893–908.

Chiang, W.L., Lee, M.C., & Lin, C.J. (2016) . Parallel dual coordinate descent method for large-scale linear classification in multi-core environments. In Proceedings of the 22Nd ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’16 (pp. 1485–1494) New York, NY, USA, ACM.

Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20 , 273–297.

Dai, H. Y., & Fletcher, R. (2006). New algorithms for singly linearly constrained quadratic programs subject to lower and upper bounds. Mathematical Programming?, 106 , 403–421.

Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9 , 1871–1874.

Fan, R. E., Chen, P. H., & Lin, C. J. (2005). Working set selection using second order information for training support vector machines. Journal of Machine Learning Research, 6 , 1889–1918.

Ferris, M. C., & Munson, T. S. (2004). Semismooth support vector machines. Mathematical Programming B, 101 , 185–204.

Ferris, M. C., & Munson, T. S. (2002). Interior-point methods for massive support vector machines. SIAM Journal on Optimization, 13 , 783–804.

Fine, S., & Scheinberg, K. (2001). Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2 , 243–264.

Fletcher, R. (1987). Practical methods of optimization (2nd ed.). Wiley.

Franc, Vojtech, & Sonnenburg, Sören. (2009). Optimized cutting plane algorithm for large-scale risk minimization. Journal of Machine Learning Research, 10 , 2157–2192.

Franc, V. & Sonnenburg, S. (2008) . Optimized cutting plane algorithm for support vector machines. In Proceedings of the 25th international conference on machine learning, ICML ’08 (pp. 320–327) New York, NY, USA, ACM.

Fung, G., & Mangasarian, O. L. (2001) . Proximal support vector machine classifiers. In Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’01 (pp. 77–86) New York, NY, USA, ACM.

Gaudioso, M., Gorgone, E., Labbé, M., & Rodríguez-Chía, A. M. (2017). Lagrangian relaxation for SVM feature selection. Computers and OR, 87 , 137–145.

Gertz, E. M., & Griffin, J. D. (2010). Using an iterative linear solver in an interior-point method for generating support vector machines. Computational Optimization and Applications, 47 , 431–453.

Glasmachers, T., & Igel, C. (2006). Maximum-gain working set selection for SVMs. Journal of Machine Learning Research, 7 , 1437–1466.

Goldfarb, D., & Scheinberg, K. (2008). Numerically stable LDLT factorizations in interior point methods for convex quadratic programming. IMA Journal of Numerical Analysis, 28 , 806–826.

Golub, G. H., & Van Loan, C. F. (1996). Matrix computations (3rd ed.). Johns Hopkins University Press.

Hsieh, C.J., Chang, K.W., Lin, C.J., Keerthi, S. S., & Sundararajan, S. (2008) . A dual coordinate descent method for large-scale linear SVM. In Proceedings of the 25th international conference on machine learning, ICML ’08 (pp. 408–415) New York.

Hsu, C. W., & Lin, C. J. (2002). A comparison of methods for multiclass support vector machines. IEEE Transanctions on Neural Networks, 13 , 415–425.

Hsu, C. W., & Lin, C. J. (2002). A simple decomposition method for support vector machines. Machine Learning, 46 , 291–314.

Hui, Teo Choon, Vishwanthan, S. V. N., Smola, Alex J., & Le Quoc, V. (2010). Bundle methods for regularized risk minimization. Journal of Machine Learning Research, 11 , 311–365.

Joachims, T. (1999) . Advances in kernel methods. Chapter making large-scale support vector machine learning practical (pp. 169–184). MIT Press

Joachims, T. (2006) . Training linear svms in linear time. In Proceedings of the 12th ACM SIGKDD International conference on knowledge discovery and data mining, KDD ’06 (pp. 217–226) New York.

Joachims, T., Finley, T., & Yu, C. N. J. (2009). Cutting-plane training of structural SVMs. Machine Learning, 77 , 27–59.

Joachims, T., & Yu, C. N. J. (2009). Sparse kernel SVMs via cutting-plane training. Machine Learning, 76 , 179–193.

Keerthi, S. S., & DeCoste, D. (2005). A modified finite Newton method for fast solution of large scale linear SVMs. Journal of Machine Learning Research, 6 , 341–361.

Keerthi, S. S., & Gilbert, E. G. (2002). Convergence of a generalized SMO algorithm for SVM classifier design. Machine Learning, 46 , 351–360.

Keerthi, S. S., & Lin, C. J. (2003). Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Computation, 15 , 1667–1689.

Kimeldorf, G. S., & Wahba, G. (1970). A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. The Annals of Mathematical Statistics, 41 (495–502), 04.

Kiwiel, K. C. (2008). Breakpoint searching algorithms for the continuous quadratic knapsack problem. Mathematical Programming, 112 , 473–491.

Le, Q. V., Smola, A. J., & Vishwanathan, S. (2008). Bundle methods for machine learning. In J. C. Platt, D. Koller, Y. Singer, & S. T. Roweis (Eds.), Advances in neural information processing systems 20 (pp. 1377–1384). Curran Associates Inc.

Lee, M. C., Chiang, W. L., & Lin, C. J. (2015). Fast matrix-vector multiplications for large-scale logistic regression on shared-memory systems. In C. Aggarwal, Z.-H. Zhou, A. Tuzhilin, & W. Xindong (Eds.), ICDM (pp. 835–840). IEEE Computer Society.

Lee, Y. J., & Mangasarian, O. L. (2001). SSVM: A smooth support vector machine for classification. Computational Optimization and Applications, 20 , 5–22.

Lin, C.-J., Lucidi, S., Palagi, L., Risi, A., & Sciandrone, M. (2009). A decomposition algorithm model for singly linearly constrained problems subject to lower and upper bounds. Journal of Optimization Theory and Applications, 141 , 107–126.

Lin, C. J. (2001). Formulations of support vector machines: A note from an optimization point of view. Neural Computation, 13 , 307–317.

Lin, C. J. (2001). On the convergence of the decomposition method for support vector machines. IEEE Transactions on Neural Networks, 12 , 1288–1298.

Lin, C. J. (2002). Asymptotic convergence of an SMO algorithm without any assumptions. IEEE Transactions on Neural Networks, 13 , 248–250.

Lin, C. J. (2002). A formal analysis of stopping criteria of decomposition methods for support vector machines. IEEE Transactions on Neural Networks, 13 , 1045–1052.

Lin, C. J., & Morè, J. J. (1999). Newton’s method for large bound-constrained optimization problems. SIAM Journal on Optimization, 9 , 1100–1127.

Lucidi, S., Palagi, L., Risi, A., & Sciandrone, M. (2007). A convergent decomposition algorithm for support vector machines. Computational Optimization and Applications, 38 , 217–234.

Lucidi, S., Palagi, L., Risi, A., & Sciandrone, M. (2009). A convergent hybrid decomposition algorithm model for SVM training. IEEE Transactions on Neural Networks, 20 , 1055–1060.

Mangasarian, O. L., & Musicant, David R. (2001). Lagrangian support vector machines. Journal of Machince Learning Research, 1 , 161–177.

Mangasarian, O. J. (1994). Nonlinear programming . Society for Industrial and Applied Mathematics.

Mangasarian, O. L. (2002). A finite Newton method for classification. Optimization Methods and Software, 17 , 913–929.

Mangasarian, O. L. (2006). Exact 1-norm support vector machines via unconstrained convex differentiable minimization. Journal of Machine Learning Research, 7 , 1517–1530.

Mangasarian, O. L., & Musicant, D. R. (1999). Successive overrelaxation for support vector machines. IEEE Transactions on Neural Networks, 10 , 1032–1037.

Mavroforakis, M. E., & Theodoridis, S. (2006). A geometric approach to support vector machine (SVM) classification. IEEE Transactions on Neural Networks, 17 , 671–682.

Osuna, E., Freund, R., & Girosi, F. (1997) . An improved training algorithm for support vector machines. In Neural networks for signal processing VII. Proceedings of the 1997 IEEE signal processing society workshop (pp. 276–285). IEEE.

Osuna, E., Freund, R., & Girosit, F. (1997) Training support vector machines: an application to face detection. In Proceedings of IEEE computer society conference on computer vision and pattern recognition (pp. 130–136).

Palagi, L., & Sciandrone, M. (2005). On the convergence of a modified version of the SVM light algorithm. Optimization Methods and Software, 20 , 315–332.

Piccialli, V., & Sciandrone, M. (2018). Nonlinear optimization and support vector machines. 4OR, 16 , 111–149.

Pardalos, P. M., & Kovoor, N. (1990). An algorithm for a singly constrained class of quadratic programs subject to upper and lower bounds. Mathematical Programming, 46 , 321–328.

Platt, J. C. (1999) . Advances in kernel methods. chapter fast training of support vector machines using sequential minimal optimization (pp. 185–208). MIT Press.

Scholkopf, B., & Smola, A. J. (2001). Learning with kernels: Support vector machines, regularization, optimization, and beyond . MIT Press.

Shalev-Shwartz, S., Singer, Y., Srebro, N., & Cotter, A. (2011). Pegasos: Primal estimated sub-gradient solver for SVM. Mathematical Programming, 127 , 3–30.

Shawe-Taylor, J., & Cristianini, N. (2004). Kernel methods for pattern analysis . Cambridge University Press.

Suykens, J. A. K., & Vandewalle, J. (1999). Least squares support vector machine classifiers. Neural Processing Letters, 9 , 293–300.

Glasmachers, T., & Dogan, Ürün (2013) . Accelerated coordinate descent with adaptive coordinate frequencies. In Asian Conference on machine learning, ACML 2013, Canberra, ACT, Australia, November 13-15, 2013 (pp. 72–86)

Serafini, T., & Zanni, L. (2005). On the working set selection in gradient projection-based decomposition techniques for support vector machines. Optimization Methods and Software, 20 , 583–596.

Tsang, I. W., Kwok, J. T., & Cheung, P. M. (2005). Core vector machines: Fast SVM training on very large data sets. Journal of Machine Learning Research, 6 , 363–392.

Vapnik, V. N. (1998). Statistical learning theory . Wiley-Interscience.

Wang, P. W., & Lin, C. J. (2014). Iteration complexity of feasible descent methods for convex optimization. Journal of Machine Learning Research, 15 , 1523–1548.

Wang, Z., Crammer, K., & Vucetic, S. (2012). Breaking the curse of kernelization: Budgeted stochastic gradient descent for large-scale SVM training. Journal of Machine Learning Research, 13 , 3103–3131.

Woodsend, K., & Gondzio, J. (2009). Hybrid MPI/openMP parallel linear support vector machine training. Journal of Machine Learning Research, 10 , 1937–1953.

Woodsend, K., & Gondzio, J. (2011). Exploiting separability in large-scale linear support vector machine training. Computational Optimization and Applications, 49 , 241–269.

Download references

Open access funding provided by Università degli Studi di Roma La Sapienza within the CRUI-CARE Agreement

Author information

Authors and affiliations.

Dipartimento di Ingegneria Informatica, Automatica e Gestionale “A. Ruberti”, Sapienza Università degli Studi di Roma, via Ariosto 25, 00185, Roma, Italy

Veronica Piccialli & Marco Sciandrone

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Marco Sciandrone .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This is an updated version of the paper that appeared in 4OR, Vol. 16, pp. 111–149, 2018.

Appendix A: Proof of existence and uniqueness of the optimal hyperplane

The idea underlying the proof of existence and uniqueness of the optimal hyperplane is based on the following steps:

for each separating hyperplane H ( w ,  b ), there exists a separating hyperplane \(H({\hat{w}},{\hat{b}})\) such that

the above condition implies that problem ( 2 ), i.e.,

admits solution provided that the following problem

admits solution;

problem ( 54 ) is obviously equivalent to

then we prove that ( 55 ) admits a unique solution, which is also the unique solution of ( 2 ).

Let \(H({\hat{w}},{\hat{b}})\) be a separating hyperplane. Then

\(\square \)

Given any separating hyperplane \(H({\hat{w}},{\hat{b}})\) , there exists a separating hyperplane \(H({\bar{w}},{\bar{b}})\) such that

Moreover there exist two points \(x^+\in A\) and \(x^-\in B\) such that

Let \({\hat{x}}^i\in A\) and \({\hat{x}}^j\in B\) be the closest points to \(H({\hat{w}},{\hat{b}})\) , that is, the two points such that

from which it follows

Let us consider the numbers \(\alpha \) and \(\beta \) such that

that is, the numbers

It can be easily verified that \(0<\alpha \le 1\) . We will show that the hyperplane \(H({\bar{w}},{\bar{b}}) \equiv H(\alpha {\hat{w}},\beta )\) is a separating hyperplane for the sets A and B , and it is such that ( 56 ) holds. Indeed, using ( 58 ), we have

As \(\alpha >0\) , we can write

from which we get that \({\bar{w}}\) and \({\bar{b}}\) satisifies ( 1 ), and hence, that \(H({\bar{w}},{\bar{b}})\) is a separating hyperplane for the sets A and B .

Furthermore, taking into account ( 61 ) and the value of \(\alpha \) , we have

Condition ( 56 ) follows from the above equality and ( 59 ). Using ( 60 ) we obtain that ( 57 ) holds with \(x^+={\hat{x}}^i\) and \(x^-={\hat{x}}^j\) . \(\square \)

Proposition 4

The following problem

admits a unique solution \((w^\star ,b^\star )\) .

Let \({{\mathcal {F}}}\) the feasible set, that is,

Given any \((w_o,b_o)\in {{\mathcal {F}}}\) , let us consider the level set

The set \({{\mathcal {L}}}_o\) is closed, and we will show that is also bounded. To this aim, assume by contradiction that there exists an unbounded sequence \(\{(w_k,b_k)\}\) belonging to \({{\mathcal {L}}}_o\) . Since \(\Vert w_k\Vert \le \Vert w_o\Vert ,\forall k\) , we must have \(|b_k|\rightarrow \infty \) . For any k we can write

and hence, as \(|b_k|\rightarrow \infty \) , for k sufficiently large, we have \(\Vert w_k\Vert ^2 >\Vert w_o\Vert ^2\) , and this contradicts the fact that \(\{(w_k,b_k)\}\) belongs to \({{\mathcal {L}}}_o\) . Thus \({{\mathcal {L}}}_o\) is a compact set.

Weirstrass’s theorem implies that the function \(\Vert w\Vert ^2\) admits a minimum \((w^\star ,b^\star )\) on \({{\mathcal {L}}}_o\) , and hence, on \(\mathcal{F}\) . As consequence, \((w^\star ,b^\star )\) is a solution of ( 62 ).

In order to prove that \((w^\star ,b^\star )\) is the unique solution, by contradiction assume that there exists a pair \(({\bar{w}},{\bar{b}})\in {{\mathcal {F}}}\) , \(({\bar{w}},{\bar{b}})\ne (w^\star ,b^\star )\) , such that \(\Vert {\bar{w}}\Vert ^2=\Vert w^\star \Vert ^2\) . Suppose \({\bar{w}}\ne w^\star \) . The set \({{\mathcal {F}}}\) is convex, so that

Since \(\Vert w\Vert ^2\) is a strictly convex function, for any \(\lambda \in (0,1)\) it follows

Getting \(\lambda =1/2\) , which corresponds to consider the pair \(({\tilde{w}}, \tilde{b})\equiv \displaystyle {\left( {1\over 2}w^\star +{1\over 2}\bar{w},{1\over 2} b^\star +{1\over 2}{\bar{b}}\right) }\) , we have \((\tilde{w},{\tilde{b}})\in {{\mathcal {F}}}\) and

and this contradicts the fact that \((w^\star ,b^\star )\) is a global minimum. Therefore, we must have \({\bar{w}}\equiv w^\star \) .

Assume \(b^\star >{\bar{b}}\) (the case \(b^\star <{\bar{b}}\) is analogous), and consider the point \({\hat{x}}^i\in A\) such that

(the existence of such a point follows from ( 57 ) of Lemma 2 ). We have

and this contradicts the fact that \({\bar{w}}^Tx^i+\bar{b}\ge 1,~\forall x^i\in A\) . As consequence, we must have \({\bar{b}}\equiv b^\star \) , and hence the uniqueness of the solution is proved. \(\square \)

Proposition 5

Let \((w^\star ,b^\star )\) be the solution of ( 62 ). Then, \((w^\star ,b^\star )\) is the unique solution of the following problem

We observe that \((w^\star ,b^\star )\) is the unique solution of the problem

Lemma 1 and Lemma  2 imply that, for any separating hyperplane H ( w ,  b ), we have

and hence, for the separating hyperplane \(H(w^\star ,b^\star )\) we obtain \(\rho (w^\star , b^\star )=\displaystyle {1\over {\Vert w^\star \Vert }}\) , which implies that \(H(w^\star ,b^\star )\) is the optimal separating hyperplane. \(\square \)

Appendix B: The Wolfe dual and its properties

Consider the convex problem

with \(f:\Re ^n\rightarrow \Re \) convex and continuously differentiable, \(g:\Re ^n\rightarrow \Re ^m\) convex and continuously differentiable, and \(h:\Re ^n\rightarrow \Re ^p\) affine functions. Then its Wolfe dual is

where \(L(x,\lambda ,\mu ) = f(x)+\lambda ^Tg(x)+\mu ^Th(x)\) .

Proposition 6

Let \(x^*\) be a global solution of problem ( 64 ) with multipliers \((\lambda ^*,\mu ^*)\) . Then it is also a solution of problem ( 65 ) and there is zero duality gap, i.e., \(f(x^*) = L(x^*,\lambda ^*,\mu ^*)\) .

The point \((x^*,\lambda ^*,\mu ^*)\) is clearly feasible for problem ( 65 ) since it satisfies the KKT conditions of problem ( 64 ). Furthermore, by complementarity ( \((\lambda ^*)^Tg(x^*)=0\) ) and feasibility ( \(h(x^*)=0\) )

so that there is zero duality gap. Furthermore, for any \(\lambda \ge 0\) , \(\mu \in \Re ^p\) , by the feasibility of \(x^*\) , we have

By the convexity assumptions on f and g , the nonnegativity of \(\lambda \) and by the linearity of h , we get that \(L(\cdot ,\lambda ,\mu )\) is a convex function in x and hence, for any feasible \((x,\lambda ,\mu )\) , we can write

where the last equality derives from the constraints of problem ( 65 ). By combining ( 66 ) and ( 67 ), we get

and hence \((x^*,\lambda ^*,\mu ^*)\) is a global solution of problem ( 65 ). \(\square \)

A stronger result can be proved when the primal problem is a convex quadratic programming problem defined by ( 6 ).

Proposition 7

First, we show how in this case problem ( 7 ) is a convex quadratic programming problem. In particular, problem ( 7 ) becomes for the quadratic case:

Multiplying the constraints ( 68 ) by \(x^T\) we get

which implies that the objective function ( 68 ) can be rewritten as

which shows how problem ( 68 ) is actually a convex quadratic optimization problem. For this problem, the KKT conditions are necessary and sufficient for global optimality, and, if we denote by v the multipliers of the equality constraints ( 69 ) and by z the multipliers of the constraints ( 70 ), we get that there must exist multipliers v and z such that the following conditions hold;

The expression of z can be derived by constraints ( 72 ), and substituted in ( 73 ) and ( 74 ), implying:

Furthermore by subtracting ( 71 ) from ( 75 ), we get

By combining ( 79 ), ( 78 ), ( 77 ) and ( 76 ) we get that the pair \((v,{\bar{\lambda }})\) satisfies the KKT conditions of problem ( 6 ), and hence setting \(x^*=v\) we get the thesis, keeping into account that point (i) derives from ( 71 ). \(\square \)

Appendix C: Kernel characterization

Proposition 8.

Let \(K:X\times X\rightarrow \Re \) be a symmetric function. Function K is a kernel if and only if the \(l\times l\) matrix

is positive semidefinite for any set of training vectors \(\{x^1,\ldots ,x^l\}\) .

necessity Symmetry derives from the symmetry of the function K . To prove positive semidefiniteness we look at the quadratic form, for any \(v\in \Re ^l\) :

sufficiency Assume

We need to prove that there exists a linear space \({{\mathcal {H}}}\) , a function \(\phi :X\rightarrow {{\mathcal {H}}}\) and a scalar product \(\langle \cdot ,\cdot \rangle \) defined on \({{\mathcal {H}}}\) such that \(k(x,y) = \langle \phi (x),\phi (y)\rangle \) for all \(x,y\in X\) . Consider the linear space

with the generic element \(f(\cdot )\)

for any \(m\in N\) , with \(\alpha _i\in \Re \) for \(i=1,\ldots ,m\) . Given two elements \(f,g\in {{\mathcal {H}}}\) , with \(g(\cdot ) = \sum _{j=1}^{m^\prime }\beta _jK(\cdot ,x^j)\) , define the function \(\rho :{{\mathcal {H}}}\times {{\mathcal {H}}}\rightarrow \Re \) defined as

It can be shown that the function \(\rho \) is a scalar product in the space \({{\mathcal {H}}}\) , by showing that the following properties hold:

\(\rho (f,g) = \rho (g,f)\)

\(\rho (f^1+f^2,g) = \rho (f^1,g)+\rho (f^2,g)\)

\(\rho (\lambda f,g) = \lambda \rho (f,g) \)

\(\rho (f,f)\ge 0\) and \(\rho (f,f)=0\) implies \(f=0\)

The first three properties are a consequence of the definition of \(\rho \) and can be easily verified. We need to show property (iv). First, we observe that, given \(f^1,\ldots ,f^p\) in \({{\mathcal {H}}}\) the matrix with elements \(\rho _{st} = \rho (f^s,f^t)\) is symmetric (thanks to property (i)) and positive semidefinite. Indeed,

This implies in turn that all principal minors have non negative determinant. Consider any \(2\times 2\) principal minor, with elements \(\rho _{ij} = \rho (f^i,f^j)\) . The nonnegativity of the determinant, and the symmetry of the matrix imply

We note that, setting \(m^\prime =1\) , \(g(\cdot ) = k(\cdot ,x)\) , f ( x ) can be written as

with \(K(\cdot ,x)\in {{\mathcal {H}}}\) . Furthermore, for any \(x,y \in X\) , we get

Using ( 81 ) with \(f^i = K(\cdot ,x)\) and \(f^j = f(x)\) we get

that implies, thanks to ( 80 ), both \(\rho (f,f)\ge 0\) and that if \(\rho (f,f)=0\) , then \(f(x)^2\le 0\) for all \(x\in X\) and hence \(f=0\) .

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Piccialli, V., Sciandrone, M. Nonlinear optimization and support vector machines. Ann Oper Res 314 , 15–47 (2022). https://doi.org/10.1007/s10479-022-04655-x

Download citation

Accepted : 06 December 2021

Published : 14 April 2022

Issue Date : July 2022

DOI : https://doi.org/10.1007/s10479-022-04655-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Statistical learning theory
  • Support vector machine
  • Convex quadratic programming
  • Wolfe’s dual theory
  • Kernel functions
  • Nonlinear optimization methods
  • Find a journal
  • Publish with us
  • Track your research

Help | Advanced Search

Mathematics > Optimization and Control

Title: two trust region type algorithms for solving nonconvex-strongly concave minimax problems.

Abstract: In this paper, we propose a Minimax Trust Region (MINIMAX-TR) algorithm and a Minimax Trust Region Algorithm with Contractions and Expansions(MINIMAX-TRACE) algorithm for solving nonconvex-strongly concave minimax problems. Both algorithms can find an $(\epsilon, \sqrt{\epsilon})$-second order stationary point(SSP) within $\mathcal{O}(\epsilon^{-1.5})$ iterations, which matches the best well known iteration complexity.

Submission history

Access paper:.

  • Download PDF
  • HTML (experimental)
  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. optimization

    how to solve svm optimization problem

  2. SVM optimization problem.

    how to solve svm optimization problem

  3. Solved Q4. Given the SVM optimization problem: minį l|wi|12

    how to solve svm optimization problem

  4. Solved Consider the general SVM optimization problem:

    how to solve svm optimization problem

  5. Solving Optimization Problem Support Vector Machine SVM || Lesson 81 || Machine Learning ||

    how to solve svm optimization problem

  6. optimization

    how to solve svm optimization problem

VIDEO

  1. 🌈 Watch This Cute Girl Solve the Murder in Murder Mystery 2 #roblox #gaming #murdermystery2 #mm2

  2. Wardrobe Installation Video

  3. Can You Solve This? 🧠 The Riddle That Baffles Minds! #riddles

  4. Classic GMD SD40 at the Prairie Dog Central Station (5/12/2023)

  5. Watch This Cute Girl Solve the Murder in Murder Mystery 2 #roblox #shorts #wow #murdermystery2

  6. Symphony Z55 Flash File 100% Tested Update Firmware OK

COMMENTS

  1. SVM: An optimization problem

    There is a general method for solving optimization problems with constraints (the method of Lagrange multipliers). To keep things focused, we'll just state the recipe here and use it to excavate insights pertaining to the SVM problem. As for why this recipe works, read this blog where Lagrange multipliers are covered in detail.

  2. PDF Support vector machines (SVMs) Lecture 2

    Support vector machines: 3 key ideas Use optimization to find solution (i.e. a hyperplane) with few errors Seek large margin generalization separator to improve 3. Use kernel trick to make large feature spaces computationally efficient Finding a perfect classifier (when one exists) using linear programming

  3. PDF An Idiot's guide to Support vector machines (SVMs)

    Support Vector Machine (SVM) SVMs maximize the margin (Winston terminology: the 'street') around the separating hyperplane. The decision function is fully specified by a (usually very small) subset of training samples, the support vectors. This becomes a Quadratic programming problem that is easy to solve by standard methods.

  4. Formulating the Support Vector Machine Optimization Problem

    The key intuitive idea behind the formulation of the SVM problem is that there are many possible separating hyperplanes for a given set of labeled training data. For example, here is a gif showing infinitely many choices.

  5. Method of Lagrange Multipliers: The Theory Behind Support Vector

    They have shown their success in solving many complex machine learning classification problems. In this tutorial, we'll look at the simplest SVM that assumes that the positive and negative examples can be completely separated via a linear hyperplane. After completing this tutorial, you will know: How the hyperplane acts as the decision boundary

  6. Support Vector Machines, Dual Formulation, Quadratic Programming

    Feb 10, 2021 1 The Support-vector Machine (or called Support-vector Networks initially by the author — Vladimir Vapnik) takes a completely different approach to solving statistical problems (in specific Classification).

  7. PDF Optimization Algorithms in Support Vector Machines

    SVM Formulations and Algorithms Oldies but goodies Recently proposed methods Possibly useful recent contributions in optimization, including applications in learning. Extensions and future lines of investigation. Focus on fundamental formulations. These have been studied hard over the past 12-15 years, but it's worth checking for "unturned stones."

  8. Lecture 9: SVM

    SVM Answer: The one that maximizes the distance to the closest data points from both classes. We say it is the hyperplane with maximum margin. Margin We already saw the definition of a margin in the context of the Perceptron. A hyperplane is defined through w, b w, b as a set of points such that H = {x|wTx + b = 0} H = { x | w T x + b = 0 } .

  9. Support Vector Machine. A dive into the math behind the SVM…

    · Follow Published in Towards Data Science · 21 min read · Jul 23, 2020 -- In this post, we'll discuss the use of support vector machines (SVM) as a classification model. We will start by exploring the idea behind it, translate this idea into a mathematical problem and use quadratic programming (QP) to solve it.

  10. Problem formulations and solvers in linear SVM: a review

    SVM problem is an optimization problem, which may be convex problem (Crammer and Singer 2002) or non-convex problem (Stempfel and Ralaivola 2009). It depends upon the problem formulation. It is easy to solve the convex optimization problem than non-convex problems because every local optimum is a global optimum in a convex optimization problem.

  11. Method of Lagrange Multipliers: The Theory Behind Support Vector

    The optimization problem of an SVM; Solution of the optimization problem in Python Define the objective function; Define the bounds and linear constraints; Solve the problem with different C values; Pre-requisites. For this tutorial, it is assumed that you are already familiar with the following topics.

  12. Guide on Support Vector Machine (SVM) Algorithm

    Non-Linear SVM: When the data is not linearly separable then we can use Non-Linear SVM, which means when the data points cannot be separated into 2 classes by using a straight line (if 2D) then we use some advanced techniques like kernel tricks to classify them.

  13. Method of Lagrange Multipliers: The Theory Behind Support Vector

    In this tutorial, we'll cover the basics of a linear SVM. We won't go into details of non-linear SVMs derived using the kernel trick. ... How to formulate the optimization problem and compute the Lagrange dual; Let's get started. ... Let's use the method of Lagrange multipliers to solve the quadratic programming problem that we ...

  14. PDF SVM as a Convex Optimization Problem

    SVM as a Convex Optimization Problem Leon Gu CSD, CMU Convex Optimization Convex set: the line segment between any two points lies in the set. Convex function: the line segment between any two points (x, f(x)) and (y, f(y)) lies on or above the graph of f. Convex optimization minimize f0(x) (1) s.t.

  15. PDF The SVM Optimization Formulation with Kernels

    The SVM Optimization Formulation with Kernels In order to show how we can learn an SVM using only kernel functions, we will rely on a result called the Representer Theorem. Recall the generalized optimization formulation of the SVM: min w ff(hw; (x 1)i;hw; (x 2)i;:::;hw; (x m)i) + R(kwk) g (1) The Representer Theorem is then,

  16. Support Vector Machine (SVM) Algorithm

    Practice Support Vector Machine (SVM) is a powerful machine learning algorithm used for linear or nonlinear classification, regression, and even outlier detection tasks.

  17. SVM

    Optimization problems with constraints Notation. An optimization problem is typically written: This notation is called the standard form. You should know that there are others notations as well.. In this notation, is called the objective function (it is also sometimes called the cost function).By changing (the optimization variable) we wish to find a value for which is at its minimum.

  18. Convex Optimization and SVM (Support Vector Machines)

    Here, I have implemented the 'cvxopt' for the ideation of the methods. cvxopt stands for 'Convex Optimization'. It is one of the methods used for solving the Langrangian problems in SVM.

  19. SVM from scratch using Quadratic Programming

    We can use qp solver of CVXOPT to solve quadratic problems like our SVM optimization problem. We just need to create matrices P, q, A, G, h and initialize a value for b. Comparing our...

  20. Demystifying Maths of SVM

    Optimization problem that the SVM algorithm solves. This is a convex optimization problem, with a convex optimization objective function and a set of constraints that define a convex set as the feasible region.Convex functions look like a bowl placed right-side-up. Convex set is a set of points in which a line joining any two points lies entirely within the set.

  21. Nonlinear optimization and support vector machines

    SVM training requires solving (large-scale) convex programming problems, whose difficulties are mainly related to the possibly huge number of training instances, that leads to a huge number of either variables or constraints.

  22. Solving Optimization Problem Support Vector Machine SVM

    ...more #machinelearning#learningmonkeyIn this class, we discuss Solving Optimization Problem Support Vector Machine SVM.To understand Solving Optimization Problem S...

  23. [2402.09807] Two trust region type algorithms for solving nonconvex

    In this paper, we propose a Minimax Trust Region (MINIMAX-TR) algorithm and a Minimax Trust Region Algorithm with Contractions and Expansions(MINIMAX-TRACE) algorithm for solving nonconvex-strongly concave minimax problems. Both algorithms can find an $(ε, \\sqrtε)$-second order stationary point(SSP) within $\\mathcal{O}(ε^{-1.5})$ iterations, which matches the best well known iteration ...

  24. Mathematical Underpinnings: SVMs + Optimisation

    This week we will get to see how Lagrange Multipliers are going to help us solve the SVM optimisation probelem.So let's remind ourselves of what has been covered so far. In week 1 we defined SVMs that learn model parameters, w and b, which maximise the margin from the decision boundary to the closest point in feature space.Furthermore, we assumed that our data points are linearly separable ...

  25. Optimization of Neural Networks with Linear Solvers

    Now, that's it! We have defined our optimization problem. We can now solve it using a solver. In this case, we use the open-source solver GLPK (a linear solver), since our problem is fully linear: # Solve the model solver = pyo.SolverFactory('glpk') results = solver.solve(objmodel)