
 Description:
Overview
The SHOGUN machine learning toolbox's focus is on large scale kernel methods and especially on Support Vector Machines (SVM). It comes with a generic interface for kernel machines and features 15 different SVM implementations that all access features in a unified way via a general kernel framework or in case of linear SVMs so called "DotFeatures", i.e., features providing a minimalistic set of operations (like the dot product).
Features
SHOGUN includes the LinAdd accelerations for string kernels and the COFFIN framework for ondemand computing of features for the contained linear SVMs. In addition it contains more advanced Multiple Kernel Learning, Multi Task Learning and Structured Output learning algorithms and other linear methods. SHOGUN digests input featureobjects of basically any known type, e.g., dense, sparse or variable length features (strings) of any type char/byte/word/int/long int/float/double/long double.
The toolbox provides efficient implementations to 35 different kernels among them the
 Linear,
 Polynomial,
 Gaussian and
 Sigmoid Kernel
and also provides a number of recent string kernels like the
 Locality Improved,
 Fischer,
 TOP,
 Spectrum,
 Weighted Degree Kernel (with shifts) .
For the latter the efficient LINADD optimizations are implemented. Also SHOGUN offers the freedom of working with custom precomputed kernels. One of its key features is the combined kernel which can be constructed by a weighted linear combination of a number of subkernels, each of which not necessarily working on the same domain. An optimal subkernel weighting can be learned using Multiple Kernel Learning. Currently SVM oneclass, 2class, multiclass classification and regression problems are supported. However SHOGUN also implements a number of linear methods like
 Linear Discriminant Analysis (LDA)
 Linear Programming Machine (LPM),
 Perceptrons and features algorithms to train Hidden Markov Models.
The input featureobjects can be read from plain ascii files (tab separated values for dense matrices; for sparse matrices libsvm/svmlight format), a efficient native binary format and general support to the hdf5 based format, supporting
 dense
 sparse or
 strings of various types
that can often be converted between each other. Chains of preprocessors (e.g. subtracting the mean) can be attached to each feature object allowing for onthefly preprocessing.
Structure and Interfaces
SHOGUN's core is implemented in C++ and is provided as a library libshogun to be readily usable for C++ application developers. Its common interface functions are encapsulated in libshogunui, such that only minimal code (like setting or getting a double matrix to/from the target language) is necessary. This allowed us to easily create interfaces to Matlab(tm), R, Octave and Python. (note that a modular object oriented and static interfaces are provided to r, octave, matlab, python, python_modular, r_modular, octave_modular, cmdline, libshogun).
Application
We have successfully applied SHOGUN to several problems from computational biology, such as Super Family classification, Splice Site Prediction, Interpreting the SVM Classifier, Splice Form Prediction, Alternative Splicing and Promoter Prediction. Some of them come with no less than 10 million training examples, others with 7 billion test examples.
Documentation
We use Doxygen for both user and developer documentation which may be read online here. More than 600 documented examples for the interfaces python_modular, octave_modular, r_modular, static python, static matlab and octave, static r, static command line and C++ libshogun developer interface can be found in the documentation.
 Changes to previous version:
This release contains major enhancements, cleanups and bugfixes:
Features
 Support for new languages: java, c#, ruby, lua in modular interfaces (GSoC project of Baozeng Ding)
 Port all examples to the new languages: Ruby examples with example transition tool (thanks to Justin Patera aka serialhex)
 Dimensionality reduction (manifold learning) algorithms are now available. In particular: Locally Linear Embedding (LLE), Hessian Locally Linear Embedding (HLLE), Local Tangent Space Alignment (LTSA), Kernel PCA (kPCA), Multidimensional Scaling (MDS, with possible landmark approximation), Isomap (using Fibonacci Heap Dijkstra for shortest paths), Laplacian Eigenmaps (GSoC project of Sergey Lisitsyn)
 Various new kernels: TStudentKernel, CircularKernel, WaveKernel, SplineKernel, LogKernel, RationalQuadraticKernel, WaveletKernel, BesselKernel, PowerKernel, ExponentialKernel, CauchyKernel, ANOVAKernel, MultiquadricKernel, SphericalKernel, DistantSegmentsKernel (thanks GSoC students for the contributions!)
 Streaming / Online Feature Framework for SimpleFeatures, SparseFeatures, StringFeatures (GSoC project of Shashwat Lal Das)
 SGDQN, Online SGD, Online Liblinear, Online Vowpal Vabit (GSoC project of Shashwat Lal Das)
 Model selection framework for arbitrary Machines (GSoC project of Heiko Strathmann)
 Gaussian Mixture Models (GSoC project of Alesis Novik)
 FibonacciHeap for efficient shortestpath problem solving (thanks to Evgeniy Andreev)
 Efficient HashSet (thanks to Evgeniy Andreev)
 ARPACK wrapper (dseupd) for symmetric eigenproblems (both generalized and nongeneralized), some new LAPACK wrappers (Sergey Lisitsyn)
 New Statistics module for various statistics measures (Heiko Strathmann)
 Subset support to features (Heiko Strathmann)
 Java externalization support (Sergey Lisitsyn)
 Support matlab 2011a.
Bugfixes
 Fix build failure with ld asneeded (thanks Matthias Klose for the patch).
 Fix initialization error in KRR static interfaces (thanks Maxwell Collins for the patch).
Cleanup and API Changes
 Introduce Machine, KernelMachine, LinearMachine, LinearOnlineMachine, DistanceMachine with train() and apply() functions and drop Classifier.
 Restructure source code layout: Merge libshogunui and libshogun into src/shogun and move all interfaces into src/shogun. Split up lib into lib, io and mathematics.
 Create a single 'modshogun' module resembling the functionality found in libshogun. Now octave_modular and other modular interfaces work reliably.
 Introduce SGVector, SGMatrix, SGNDArray, SGStringList for transfering objectpointers and metadata from/to shogun.
 Classes no longer store copies of e.g. matrices, and just pass pointers on set/get operations.
 Stop using new[] / delete[] and switch to SG_MALLOC, SG_CALLOC, SG_REALLOC, SG_FREE macros.
 Preproc renamed to preprocessor, PCACut renamed to PCA
 BibTeX Entry: Download
 Corresponding Paper BibTeX Entry: Download
 Supported Operating Systems: Cygwin, Linux, Macosx
 Data Formats: Plain Ascii, Svmlight, Binary, Fasta, Fastq, Hdf
 Tags: Bioinformatics, Large Scale, String Kernel, Kernel, Kernelmachine, Lda, Lpm, Matlab, Mkl, Octave, Python, R, Svm, Sgd, Icml2010, Liblinear, Libsvm, Multiple Kernel Learning, Ocas
 Archive: download here
Comments

 Soeren Sonnenburg (on September 12, 2008, 16:14:36)
 In case you find bugs, feel free to report them at [http://trac.tuebingen.mpg.de/shogun](http://trac.tuebingen.mpg.de/shogun).

 Tom Fawcett (on January 3, 2011, 03:20:48)
 You say, "Some of them come with no less than 10 million training examples, others with 7 billion test examples." I'm not sure what this means. I have problems with mixed symbolic/numeric attributes and the training example sets don't fit in memory. Does SHOGUN require that training examples fit in memory?

 Soeren Sonnenburg (on January 14, 2011, 18:12:01)
 Shogun does not necessarily require examples to be in memory (if you use any of the FileFeatures). However, most algorithms within shogun are batch type  so using the non inmemory FileFeatures would probably be very slow. This does not matter for doing predictions of course, even though the 7 billion test examples above referred to predicting gene starts on the whole human genome (in memory ~3.5GB and a context window of 1200nt was shifted around in that string). In addition one can compute features (or feature space) onthefly potentially saving lots of memory. Not sure how big your problem is but I guess this is better discussed on the shogun mailinglist.

 Yuri Hoffmann (on September 14, 2013, 17:12:16)
 cannot use the java interface in cygwin (already reported on github) nor in debian.
Leave a comment
You must be logged in to post comments.