Support Vector Machines
Essay by Nicolas • July 11, 2011 • Case Study • 1,738 Words (7 Pages) • 1,846 Views
Support Vector Machines (SVM)
Support vector machines (Vapnik, 1998; Cristianini and Shawe-Taylor, 2000), are a kind of supervised machine learning algorithm, that is an algorithm that is able to learn from examples in a supervised way (we teach the algorithm with some information so it learns with our supervision). So they can be used to solve problems for which we do not know the direct computational way to the solution.
Example: physical problems for which we know the underlying mathematical law, we can predict their evolution starting from some initial conditions, but we do not know enough about the relation between a cell and its gene expression pattern.
In this kind of situations we can use machine learning, training the algorithm to learn the characteristics of the system (eg. gene expression patterns).
Why SVM instead of other supervised learning methods such as perceptrons?
Because SVM are proven to get the optimal solution for a given set of training data.
This is due to the fact that the space search is a convex function with a unique optimal solution, avoiding the local minima which are one of the pitfalls of perceptrons. The optimal solution is always found because the method is based on Lagrangian Variational Calculus. It is an optimization problem restricted to some conditions.
What is what it really does?
It searches the Input space, for instance the gene expression space where the samples are located, for a hyperplane which has to be as far as possible from all the samples of each of our classes to learn.
This is done finding the hyperplane that maximizes a magnitud called margin, that is the distance from the hyperplane to the closest points of each class to it.
In the case of gene expression data the input space is a space of several thousands dimensions and each of the samples is a point with a label (eg. cancer and non cancer). So the Support Vector Machine will find the best hyperplane to separate the two classes.
Multiclass classification can be performed by means of several biclassifications, see Chang and Lin 2002.
An additional problem for this kind of data is that due to the difficulties to get a number of samples enough we have a very sparse space with few points in spaces with a lot of dimensions, so the input space is nearly empty. When this happens it is quite probable to find a separating hyperplanes without having an underlying structure of classes. That is one of the causes why perceptrons are said to generalize poorly when the data set has more dimensions than points.
SVM overcome this in some extent because they find the optimal hyperplane always.
Non linear classifications
At this moment we have talked just about separating classes with hyperplanes. That is a linear separation and it is known in real life it is not difficult to find non linearly separable problems. How to cope with this?.
We can use a mathematical trick which consists in stretching and contorting the input space into a new space (may be with more dimensions than input space) called the "feature space", in which our data probably will be separable.
This operation is done implicitly by means of a kernel function (the reason to do it implicitly is that otherwise it would take too long, so it would be computationally not possible).
Increasing the number of dimensions is useful to classify non separable data but it also increases the risk of overfitting the data, which means a poor generalisation capacity. The right way should be start with simplest kernels as posible.
A more detailed introduction review on SVM can be found at Burgess, 1998, and some applications of SVM to microarray data are Brown et al. 2000, and Furey et al. 2000
back to svm_train
back to svm_predict
How to use this SVM web interface?
At this SVM site you can find two programs:
SVM train
SVM classify
SVM train allows you to build a model based on your data and save it at your local computer for later classification of new data, and at the same time it will perform a cross-validation on the training data.
SVM classify will be used to classify new samples data using a model previously built with SVM train.
Train program
data files and format
Two files are needed:
1. One file with the gene expression data in which each row is a gene and each column is a sample,
2. #NAMES g1 g2 g1 g1 g2
3. gene1 23.4 45.6 2.0 76 85.6
4. #comment line
5. genW@ 0.1 34 23 1.3 13
6. genX# 23 25.6 29.4 13.2 0.4
Lines of comments are allowed, they start with a '#', they will be ignored.
The line of #NAMES is mandatory, all the before lines will be ignored (although they have no '#').
Example of data file, (figure 1 of Alizadeh et al. 2000 with all the identifiers for each clone coallapsed into one, the strange characters like ')', ',' etc, have been replaced with underscores to avoid problems with Newick format in other tools of this server).
7. One file with the labels of the classes the samples belong to. Format is one line and, separated by tabs, as many columns as samples.
8. class1 class1 class2 class1 class2
Example of labels file, for the data from Alizadeh et al. 2000
The data format is quite similar to that used for SOTA and Pomelo Tools at this same server, the only differences are:
1. Missing values in this case are not allowed.
2. You need a name for each of the samples (a names line is mandatory).
...
...