-
- Description:
Overview
The SHOGUN machine learning toolbox's focus is on large scale kernel methods and especially on Support Vector Machines (SVM). It comes with a generic interface for kernel machines and features 15 different SVM implementations that all access features in a unified way via a general kernel framework or in case of linear SVMs so called "DotFeatures", i.e., features providing a minimalistic set of operations (like the dot product).
Features
SHOGUN includes the LinAdd accelerations for string kernels and the COFFIN framework for on-demand computing of features for the contained linear SVMs. In addition it contains more advanced Multiple Kernel Learning, Multi Task Learning and Structured Output learning algorithms and other linear methods. SHOGUN digests input feature-objects of basically any known type, e.g., dense, sparse or variable length features (strings) of any type char/byte/word/int/long int/float/double/long double.
The toolbox provides efficient implementations to 35 different kernels among them the
- Linear,
- Polynomial,
- Gaussian and
- Sigmoid Kernel
and also provides a number of recent string kernels like the
- Locality Improved,
- Fischer,
- TOP,
- Spectrum,
- Weighted Degree Kernel (with shifts) .
For the latter the efficient LINADD optimizations are implemented. Also SHOGUN offers the freedom of working with custom pre-computed kernels. One of its key features is the combined kernel which can be constructed by a weighted linear combination of a number of sub-kernels, each of which not necessarily working on the same domain. An optimal sub-kernel weighting can be learned using Multiple Kernel Learning. Currently SVM one-class, 2-class, multi-class classification and regression problems are supported. However SHOGUN also implements a number of linear methods like
- Linear Discriminant Analysis (LDA)
- Linear Programming Machine (LPM),
- Perceptrons and features algorithms to train Hidden Markov Models.
The input feature-objects can be read from plain ascii files (tab separated values for dense matrices; for sparse matrices libsvm/svmlight format), a efficient native binary format and general support to the hdf5 based format, supporting
- dense
- sparse or
- strings of various types
that can often be converted between each other. Chains of preprocessors (e.g. subtracting the mean) can be attached to each feature object allowing for on-the-fly pre-processing.
Structure and Interfaces
SHOGUN's core is implemented in C++ and is provided as a library libshogun to be readily usable for C++ application developers. Its common interface functions are encapsulated in libshogunui, such that only minimal code (like setting or getting a double matrix to/from the target language) is necessary. This allowed us to easily create interfaces to Matlab(tm), R, Octave and Python. (note that a modular object oriented and static interfaces are provided to r, octave, matlab, python, python_modular, r_modular, octave_modular, cmdline, libshogun).
Application
We have successfully applied SHOGUN to several problems from computational biology, such as Super Family classification, Splice Site Prediction, Interpreting the SVM Classifier, Splice Form Prediction, Alternative Splicing and Promoter Prediction. Some of them come with no less than 10 million training examples, others with 7 billion test examples.
Documentation
We use Doxygen for both user and developer documentation which may be read online here. More than 600 documented examples for the interfaces python_modular, octave_modular, r_modular, static python, static matlab and octave, static r, static command line and C++ libshogun developer interface can be found in the documentation.
- Changes to previous version:
This release contains major enhancements, cleanups and bugfixes:
Features
- New dimensionality reduction algorithms: Diffusion Maps, Kernel Locally Linear Embedding, Kernel Local Tangent Space Alignment, Linear Local Tangent Space Alignment, Neighborhood Preserving embedding, Locality Preserving Projections.
- Various performance improvements for dimensionality reduction methods (BLAS, alignment formulation of the LLE, ..)
- Automatical k determination mode for Locally Linear Embedding dimension reduction method based on reconstruction error.
- ARPACK and SUPERLU integration.
- Introduce the concept of Converters that can embed (arbitrary) feature types into different feature types.
- LibSVM is now pthread-parallelized.
- Create modshogun.dll for csharp.
- Various new c# examples (thanks Daniel Korn).
- Dimensionality reduction examples application is introduced
Bugfixes
- Octave_static and octave_modular examples fix.
- Memory leak in custom kernel is now eliminated (thanks Madeleine Seeland for reporting).
- Fix for linear machine set_w method (thanks Brian Cheung for reporting).
- DotFeatures fix for assert bug.
- FibonacciHeap memory leak fix.
- Fix for Java modular interface typemapping bug.
- Fix errors uncovered by LLVM / clang++.
- Fix for configure on Darwin-x86_64 (thanks Peter Romov for patch).
- Improve lua / ruby detection.
- Fix configure / compilation under osx and cygwin for variuos interfaces.
Cleanup and API Changes
- Most of the inline functions have been (re)moved to the corresponding .cpp file
- Libshogun is now being compiled with sse support for math (if available) but interfaces are now being compiled with -O0 key which drastically reduces compilation time
- BibTeX Entry: Download
- Corresponding Paper BibTeX Entry: Download
- URL: Project Homepage
- JMLR MLOSS PaperURL: JMLR-MLOSS Paper Homepage
- Supported Operating Systems: Cygwin, Linux, Macosx
- Data Formats: Plain Ascii, Svmlight, Binary, Fasta, Fastq, Hdf
- Tags: Bioinformatics, Large Scale, String Kernel, Kernel, Kernelmachine, Lda, Lpm, Matlab, Mkl, Octave, Python, R, Svm, Sgd, Icml2010, Liblinear, Libsvm, Multiple Kernel Learning, Ocas
- Archive: download here
Other available revisons
-
Version Changelog Date 1.1.0 This release contains major enhancements, cleanups and bugfixes:
Features
- New dimensionality reduction algorithms: Diffusion Maps, Kernel Locally Linear Embedding, Kernel Local Tangent Space Alignment, Linear Local Tangent Space Alignment, Neighborhood Preserving embedding, Locality Preserving Projections.
- Various performance improvements for dimensionality reduction methods (BLAS, alignment formulation of the LLE, ..)
- Automatical k determination mode for Locally Linear Embedding dimension reduction method based on reconstruction error.
- ARPACK and SUPERLU integration.
- Introduce the concept of Converters that can embed (arbitrary) feature types into different feature types.
- LibSVM is now pthread-parallelized.
- Create modshogun.dll for csharp.
- Various new c# examples (thanks Daniel Korn).
- Dimensionality reduction examples application is introduced
Bugfixes
- Octave_static and octave_modular examples fix.
- Memory leak in custom kernel is now eliminated (thanks Madeleine Seeland for reporting).
- Fix for linear machine set_w method (thanks Brian Cheung for reporting).
- DotFeatures fix for assert bug.
- FibonacciHeap memory leak fix.
- Fix for Java modular interface typemapping bug.
- Fix errors uncovered by LLVM / clang++.
- Fix for configure on Darwin-x86_64 (thanks Peter Romov for patch).
- Improve lua / ruby detection.
- Fix configure / compilation under osx and cygwin for variuos interfaces.
Cleanup and API Changes
- Most of the inline functions have been (re)moved to the corresponding .cpp file
- Libshogun is now being compiled with sse support for math (if available) but interfaces are now being compiled with -O0 key which drastically reduces compilation time
December 13, 2011, 05:11:29 1.0.0 This release contains major enhancements, cleanups and bugfixes:
Features
- Support for new languages: java, c#, ruby, lua in modular interfaces (GSoC project of Baozeng Ding)
- Port all examples to the new languages: Ruby examples with example transition tool (thanks to Justin Patera aka serialhex)
- Dimensionality reduction (manifold learning) algorithms are now available. In particular: Locally Linear Embedding (LLE), Hessian Locally Linear Embedding (HLLE), Local Tangent Space Alignment (LTSA), Kernel PCA (kPCA), Multidimensional Scaling (MDS, with possible landmark approximation), Isomap (using Fibonacci Heap Dijkstra for shortest paths), Laplacian Eigenmaps (GSoC project of Sergey Lisitsyn)
- Various new kernels: TStudentKernel, CircularKernel, WaveKernel, SplineKernel, LogKernel, RationalQuadraticKernel, WaveletKernel, BesselKernel, PowerKernel, ExponentialKernel, CauchyKernel, ANOVAKernel, MultiquadricKernel, SphericalKernel, DistantSegmentsKernel (thanks GSoC students for the contributions!)
- Streaming / Online Feature Framework for SimpleFeatures, SparseFeatures, StringFeatures (GSoC project of Shashwat Lal Das)
- SGD-QN, Online SGD, Online Liblinear, Online Vowpal Vabit (GSoC project of Shashwat Lal Das)
- Model selection framework for arbitrary Machines (GSoC project of Heiko Strathmann)
- Gaussian Mixture Models (GSoC project of Alesis Novik)
- FibonacciHeap for efficient shortest-path problem solving (thanks to Evgeniy Andreev)
- Efficient HashSet (thanks to Evgeniy Andreev)
- ARPACK wrapper (dseupd) for symmetric eigenproblems (both generalized and non-generalized), some new LAPACK wrappers (Sergey Lisitsyn)
- New Statistics module for various statistics measures (Heiko Strathmann)
- Subset support to features (Heiko Strathmann)
- Java externalization support (Sergey Lisitsyn)
- Support matlab 2011a.
Bugfixes
- Fix build failure with ld --as-needed (thanks Matthias Klose for the patch).
- Fix initialization error in KRR static interfaces (thanks Maxwell Collins for the patch).
Cleanup and API Changes
- Introduce Machine, KernelMachine, LinearMachine, LinearOnlineMachine, DistanceMachine with train() and apply() functions and drop Classifier.
- Restructure source code layout: Merge libshogunui and libshogun into src/shogun and move all interfaces into src/shogun. Split up lib into lib, io and mathematics.
- Create a single 'modshogun' module resembling the functionality found in libshogun. Now octave_modular and other modular interfaces work reliably.
- Introduce SGVector, SGMatrix, SGNDArray, SGStringList for transfering object-pointers and meta-data from/to shogun.
- Classes no longer store copies of e.g. matrices, and just pass pointers on set/get operations.
- Stop using new[] / delete[] and switch to SG_MALLOC, SG_CALLOC, SG_REALLOC, SG_FREE macros.
- Preproc renamed to preprocessor, PCACut renamed to PCA
September 1, 2011, 02:09:45 0.10.0 This release contains several enhancements, cleanups and bugfixes:
Features
- Serialization of objects deriving from CSGObject, i.e. all shogun objects (SVM, Kernel, Features, Preprocessors, ...) as ASCII, JSON, XML and HDF5
- Create SVMLightOneClass
- Add CustomDistance in analogy to custom kernel
- Add HistogramIntersectionKernel (thanks Koen van de Sande for the patch)
- Matlab 2010a support
- SpectrumMismatchRBFKernel modular support (thanks Rob Patro for the patch)
- Add ZeroMeanCenterKernelNormalizer (thanks Gorden Jemwa for the patch)
- Swig 2.0 support
Bugfixes
- Custom Kernels can now be > 4G (thanks Koen van de Sande for the patch)
- Set C locale on startup in init_shogun to prevent incompatiblies with ascii floats and fprintf
- Compile fix when reference counting is disabled
- Fix set_position_weights for wd kernel (reported by Dave duVerle)
- Fix set_wd_weights for wd kernel.
- Fix crasher in SVMOcas (reported by Yaroslav)
Cleanup and API Changes
- Renamed SVM_light/SVR_light to SVMLight etc.
- Remove C prefix in front of non-serializable class names
- Drop CSimpleKernel and introduce CDotKernel as its base class. This way all dot-product based kernels can be applied on top of DotFeatures and only a single implementation for such kernels is needed.
December 7, 2010, 15:35:26 0.9.3 This release contains several enhancements, cleanups and bugfixes:
Features
- Experimental lp-norm MCMKL
- New Kernels: SpectrumRBFKernelRBF, SpectrumMismatchRBFKernel, WeightedDegreeRBFKernel
- WDK kernel supports amino acids
- String Features now support append operations
- python-dbg support
- Allow floats as input for custom kernel (and matrices > 4GB in size)
Bugfixes
- Static linking fix.
- Fix sparse linear kernel's add_to_normal
Cleanup and API Changes
- Remove init() function in Performance Measures
- Adjust .so suffix for python and use python distutils to figure out install paths
May 31, 2010, 15:31:49 0.9.2 This release contains several enhancements, cleanups and bugfixes:
Features
- Direct reading and writing of ASCII/Binary files/HDF5 based files.
- Implemented multi task kernel normalizer.
- Implement SNP kernel.
- Implement time limit for libsvm/libsvr.
- Integrate Elastic Net MKL (thanks Ryoata Tomioka for the patch).
- Implement Hashed WD Features.
- Implement Hashed Sparse Poly Features.
- Integrate liblinear 1.51
- LibSVM can now be trained with bias disabled.
- Add functions to set/get global and local io/parallel/... objects.
Bugfixes
- Fix set_w() for linear classifiers.
- Static Octave, Python, Cmdline and Modular Python interfaces Compile cleanly under Windows/Cygwin again.
- In static interfaces testing could fail when not directly done after training.
March 31, 2010, 00:50:12 0.9.1 This release contains several enhancements, cleanups and bugfixes:
Features
- Integrate LaRank.
- Memory Mapped Features (for data sets that don't fit into memory).
- Compressor module with compression and decompression support for lzo, gzip, bzip2 and lzma.
- Compressed String Features with on-the-fly decompression (CDecompressString preproc).
- Parallel computation of get_kernel_matrix().
- One may now prefix all shogun print/outputs with file name and line number (obj.io.enable_file_and_line())
- Chinese Documentation thanks Elpmis Lee.
Bugfixes
- Fix One class MKL testing in static interfaces.
- Configure fixes: Let octave not write history on configure; fail when cplex is forcefully enabled but not found; add cplex 12 support.
- Fix a problem with regression and CombinedKernels employing only Custom kernels.
Cleanup and API Changes
- String Features now (like SimpleFeatures) upon get_feature_vector require an additional do_free argument and need to be freed using free_feature_vector.
November 16, 2009, 11:02:41 0.9.0 This release contains several cleanups and enhancements:
Features
- Implement set_linear_classifier for static interfaces.
- Implement Polynomial DotFeatures.
- Implement domain adaptation SVM.
- Speed up ScatterSVM.
- Initial implementation for saving and Loading of shogun objects.
- Examples have been polished/split up into separate files.
- Documentation and webpage improvements.
Bugfixes
- Fix one class MKL for static interfaces.
- Fix performance measures integer overflow.
- Configure fixes to run under OSX's snow leopard.
- Compiles and runs under solaris both using suncc and gcc.
Cleanup and API Changes
- It is no longer necessary to call init_kernel TRAIN/TEST.
- Removed kernel {load,save}_init.
- Removed preproc {load,save}_init.
- Move the mkl code from classifier/svm to classifier/mkl.
- Removed obsolete mindy support.
- Rename MCSVM to ScatterSVM
- Move distributions to distributions/ directory.
- CClassifier::classify() no longer has a label as argument.
- Introduce CClassifier::train(CFeatures * ) and classify(CFeatures *) for more effective training/testing.
- Remove unnecessary global symbols.
October 23, 2009, 14:23:21 0.8.0 This release contains several cleanups, features and bugfixes:
Features
- Implements new multiclass svm formulation.
- 1,2 and general q-norm MKL for classification, regression and one-class for wrapper and chunking algorithm for arbitrary (dual) SVM solvers.
- Dynamic Programming code is now accessible from python.
- Implements Regulatory Modules kernel.
- Documentation updates (Tutorial, improved installation instructions, overview about the implemented algorithms).
Bugfixes
- Correct q-norm MKL for Newton.
- Upon make install of elwms don't install files into R/octave/python if these interfaces were not configured
- Svm-nu parameter was not set correctly.
- Fix custom kernel initialization.
- Correct get_subkernel_weights.
- Proper Intel core2 compile flags detection
- Fix number of outputs for KNN.
- Run tests with proper LD_LIBRARY_PATH set.
- Fix several memory leaks.
Cleanup and API Changes
- Rename svm_one_class_nu to svm_nu.
- Clean up dynamic programming code.
- Remove commands from_position_list and slide_window and move functionallity into set/add_features,
- Remove now obsolete legacy examples.
August 16, 2009, 19:53:50 0.7.3 This release contains several cleanups and bugfixes:
Features
- Improve libshogun/developer tutorial.
- Implement convenience function for parallel quicksort.
- Fasta/fastq file loading for StringFeatures.
Bugfixes
- get_name function was undefined in Evaluation causing the PerformanceMeasures class to be defunct.
- Workaround bugs in the std template library for math functions.
- Compiles cleanly under OSX now, thanks to James Kyle.
Cleanup and API Changes
- Make sure that all destructors are declared virtual.
May 2, 2009, 22:45:13 0.7.2 This release contains several cleanups and enhancements:
Features:
- Support all data types from python_modular: dense, scipy-sparse csc_sparse matrices and strings of type bool, char, (u)int{8,16,32,64}, float{32,64,96}. In addition, individual vectors/strings can now be obtained and even changed. See examples/python_modular/features_*.py for examples.
- AUC maximization now works with arbitrary kernel SVMs.
- Documentation updates, many examples have been polished.
- Slightly speedup Oligo kernel.
Bugfixes:
- Fix reading strings from directory (f.load_from_directory()).
- Update copyright to 2009.
Cleanup and API Changes:
- Remove {Char,Short,Word,Int,Real}Features and only ever use the templated SimpleFeatures.
- Split up examples in examples/python_modular to separate files.
- Now use s.set_features(strs) instead of s.set_string_features(strs) to set string features.
- The meaning of the width parameter for the Oligo Kernel changed, the OligoKernel has been renamed to OligoStringKernel.
March 23, 2009, 10:23:04 0.7.1 This release contains several cleanups, feature enhancements and bugfixes:
Features:
- configure now detects libshogun/ui installed in /usr/(local/)lib if libshogun/ui dirs are removed.
- Improved documentation (and path and doxygen fixes).
- Tutorial on how to develop with libshogun and to extend shogun.
- Added the elwms (eilergendewollmilchsau) interface that is a chimera that in one file interfaces to python,octave,r,matlab and provides the run_{octave,python,r} command to run code in {octave,python,r} from within octave,r,matlab,python transparently making variables available to the target interface avoiding file i/o.
- Implement AttributeFeatures for (attr,value) pairs, trees etc.
Bugfixes:
- fix a crasher occurring with combined kernel and multiple threads.
- configure now allows building of modular interfaces only.
- n-dimensional arrays work now in octave.
Cleanup and API Changes:
- Custom Kernel no longer requires features nor initialization, even not when used in CombinedKernel (the combined kernel will skip over custom kernels on init).
March 8, 2009, 20:30:32 0.7.0 This release contains major feature enhancements and bugfixes:
- Implement DotFeatures and CombinedDotFeatures. DotFeatures need to provide dot-product and similar operations (hence the name). This enables training of linear methods with mixed datatypes (sparse and dense and other even the newly implemented string based SpecFeatures and WDFeatures).
- MKL now does not require CPLEX any longer.
- Add q-norm MKL support based on internal Newton implementation.
- Add 1-norm MKL support based on GLPK.
- Add multiclass MKL support based on the GLPK and the GMNP svm solver.
- Implement Tensor Product Pair Kernel (TPPK).
- Support compilation on the iPhone :)
- Add an option to set wds kernel position weights.
- Build static libshogun.a for libshogun target.
- Testsuite can also test the modular R interface, added test for OligoKernel.
- Ocas and WDOcas can be used with a bias feature now.
- Update to LibSVM 2.88.
- Enable parallelized HMM code by default.
February 20, 2009, 10:41:46 0.6.7 Initial Announcement on mloss.org.
October 11, 2007, 21:45:32
Comments
-
- Tom Fawcett (on January 3, 2011, 03:20:48)
You say, "Some of them come with no less than 10 million training examples, others with 7 billion test examples." I'm not sure what this means. I have problems with mixed symbolic/numeric attributes and the training example sets don't fit in memory. Does SHOGUN require that training examples fit in memory?
-
- Soeren Sonnenburg (on January 14, 2011, 18:12:01)
Shogun does not necessarily require examples to be in memory (if you use any of the FileFeatures). However, most algorithms within shogun are batch type - so using the non in-memory FileFeatures would probably be very slow.
This does not matter for doing predictions of course, even though the 7 billion test examples above referred to predicting gene starts on the whole human genome (in memory ~3.5GB and a context window of 1200nt was shifted around in that string).
In addition one can compute features (or feature space) on-the-fly potentially saving lots of memory.
Not sure how big your problem is but I guess this is better discussed on the shogun mailinglist.
Leave a comment
You must be logged in to post comments.




In case you find bugs, feel free to report them at http://trac.tuebingen.mpg.de/shogun.