mloss.org new softwarehttp://mloss.orgUpdates and additions to mloss.orgenThu, 20 Apr 2017 18:32:33 -0000DynaML 1.4.1http://mloss.org/revision/view/2074/<html><p>DynaML was born out of the need to have a performant, extensible and easy to use Machine Learning research environment. Scala was a natural choice for these requirements due to its sprawling data science ecosystem (i.e. Apache Spark), its functional object-oriented duality and its interoperability with the Java Virtual Machine.
</p>
<p>DynaML leverages a number of open source projects and builds on their useful features.
</p>
<ul>
<li>
Breeze for linear algebra operations with vectors, matrices etc.
</li>
<li>
Spire for creating algebraic entities like Fields, Groups etc.
</li>
<li>
Ammonite for the shell environment.
</li>
<li>
DynaML uses the newly minted Wisp plotting library to generate aesthetic charts of common model validation metrics. In version 1.4 there is also integration of plotly which can now be imported and used directly in the shell environment.
</li>
</ul>
<h2>Modules:</h2>
<p>Core :-
</p>
<p>The core api consists of :
</p>
<ul>
<li>
Model implementations
</li>
<li>
Optimisation solvers
</li>
<li>
Probability distributions/random variables
</li>
<li>
Kernel functions for Non parametric models
</li>
</ul>
<p>Data Pipes :-
</p>
<p>The pipes module aims to separate model pre-processing tasks such as cleaning data files, replacing missing or corrupt records, applying transformations on data etc:
</p>
<ul>
<li>
Ability to create arbitrary workflows from scala functions and join them
</li>
<li>
Feature transformations such as wavelet transform, gaussian scaling, auto-encoders etc
</li>
</ul></html>Mandar ChandorkarThu, 20 Apr 2017 18:32:33 -0000http://mloss.org/software/rss/comments/2074http://mloss.org/revision/view/2074/machine learninggaussian processespycobra regression analysis and ensemble toolkit 0.1.0http://mloss.org/revision/view/2073/<html><p>pycobra is a python library which implements the COBRA algorithm described in the paper by Biau, Fischer, Guedj and Malley [2016], COBRA: A combined regression strategy, Journal of Multivariate Analysis.
</p>
<p>The COBRA algorithm is a aggregation of predictors technique which can be used to solve regression problems. pycobra also offers various visualisation and diagnostic methods built on top of matplotlib which lets the user analyse and compare different regression machines with COBRA.
</p></html>Bhargav Srinivasa Desikan, Benjamin GuedjWed, 19 Apr 2017 15:04:14 -0000http://mloss.org/software/rss/comments/2073http://mloss.org/revision/view/2073/regressionvisualizationmachine learningMLPACK 2.2.1http://mloss.org/revision/view/2072/<html><p>mlpack is a scalable C++ machine learning library. Its aim is to make large-scale machine learning possible for novice users by means of a simple, consistent API, while simultaneously exploiting C++ language features to provide maximum performance and maximum flexibility for expert users.
</p>
<p>The following methods are provided:
</p>
<ul>
<li>
Approximate furthest neighbor search techniques
</li>
<li>
Collaborative Filtering (with NMF)
</li>
<li>
Decision Stumps
</li>
<li>
DBSCAN
</li>
<li>
Density Estimation Trees
</li>
<li>
Euclidean Minimum Spanning Trees
</li>
<li>
Fast Exact Max-Kernel Search (FastMKS)
</li>
<li>
Gaussian Mixture Models (GMMs)
</li>
<li>
Hidden Markov Models (HMMs)
</li>
<li>
Hoeffding trees (streaming decision trees)
</li>
<li>
Kernel Principal Components Analysis (KPCA)
</li>
<li>
K-Means Clustering
</li>
<li>
Least-Angle Regression (LARS/LASSO)
</li>
<li>
Local Coordinate Coding
</li>
<li>
Locality-Sensitive Hashing (LSH)
</li>
<li>
Logistic regression
</li>
<li>
Naive Bayes Classifier
</li>
<li>
Neighborhood Components Analysis (NCA)
</li>
<li>
Nonnegative Matrix Factorization (NMF)
</li>
<li>
Perceptron
</li>
<li>
Principal Components Analysis (PCA)
</li>
<li>
QUIC-SVD
</li>
<li>
RADICAL (ICA)
</li>
<li>
Regularized SVD
</li>
<li>
Rank-Approximate Nearest Neighbor (RANN)
</li>
<li>
Simple Least-Squares Linear Regression (and Ridge Regression)
</li>
<li>
Sparse Autoencoder
</li>
<li>
Sparse Coding
</li>
<li>
Tree-based Neighbor Search (all-k-nearest-neighbors, all-k-furthest-neighbors), using either kd-trees or cover trees
</li>
<li>
Tree-based Range Search
</li>
<li>
and also more not listed here
</li>
</ul>
<p>Command-line executables are provided for each of these, and the C++ classes which define the methods are highly flexible, extensible, and modular. More information (including documentation, tutorials, and bug reports) is available at http://www.mlpack.org/.
</p></html>Ryan Curtin, James Cline, Neil Slagle, Matthew Amidon, Ajinkya Kale, Bill March, Nishant Mehta, Parikshit Ram, Dongryeol Lee, Rajendran Mohan, Trironk Kiatkungwanglai, Patrick Mason, Marcus Edel, etc.Thu, 13 Apr 2017 22:25:04 -0000http://mloss.org/software/rss/comments/2072http://mloss.org/revision/view/2072/gmmhmmmachine learningsparsedual treefastscalabletreeTheano 0.9.0http://mloss.org/revision/view/2070/<html><p>Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Theano features:
</p>
<pre><code>* tight integration with numpy – Use numpy.ndarray in Theano-compiled functions.
* transparent use of a GPU – perform data-intensive computations much faster than on a CPU.
* symbolic differentiation – Let Theano do your derivatives.
* speed and stability optimizations – Get the right answer for log(1+x) even when x is really tiny.
* dynamic C code generation – Evaluate expressions faster.
* extensive unit-testing and self-verification – Detect and diagnose many types of mistake.
</code></pre><p>Theano has been powering large-scale computationally intensive scientific investigations since 2007. But it is also approachable enough to be used in the classroom (IFT6266 at the University of Montreal).
</p>
<p>Theano has been used primarily to implement large-scale deep learning algorithms. To see how, see the Deep Learning Tutorials (http://www.deeplearning.net/tutorial/)
</p></html>mostly LISA labMon, 10 Apr 2017 20:30:17 -0000http://mloss.org/software/rss/comments/2070http://mloss.org/revision/view/2070/pythoncudagpusymbolic differentiationnumpyCalibrated AdaMEC 1.0http://mloss.org/revision/view/2069/<html><p>AdaBoost is a successful and popular classification method, but it is not geared towards solving cost-sensitive classification problems, i.e. problems where the costs of different types of erroneous predictions are unequal. In our 2016 paper cited below, we reviewed all cost-sensitive variants of AdaBoost in the literature, along with our own adaptations. Below we provide code for the method that achieves the best empirical results without any need for parameter tuning, while satisfying all desirable theoretical properties. The method, 'Calibrated AdaMEC', is described in detail and motivated in the paper:
</p>
<p>Cost-sensitive boosting algorithms: Do we really need them?
Nikolaos Nikolaou, Narayanan U. Edakunni, Meelis Kull, Peter A. Flach, Gavin Brown
Machine Learning, 104(2), pages 359-384, 2016.
</p>
<p>If you make use of the code found here, please cite the above paper.
</p>
<hr />
<p>Example of use:
</p>
<p>The following example showcases how to train and generate scores and predictions under Calibrated AdaMEC. The syntax follows the conventions of the AdaBoost implementation of scikit-learn.
</p>
<p>from sklearn.ensemble import AdaBoostClassifier
</p>
<p>from CalibratedAdaMEC import CalibratedAdaMECClassifier # Our calibrated AdaMEC implementation can be found here
</p>
<p>The code below assumes the user already split the binary classification data (classes denoted 0,1)
into training and test sets, that they defined the cost of a false positive C_FP & the cost of a
false negative C_FN and selected the weak learner base_estimator and the ensemble size n_estimators
</p>
<p>Create and train an AdaBoostClassifier:
</p>
<p>AdaBoost = AdaBoostClassifier(base_estimator, n_estimators)
</p>
<p>AdaBoost = AdaBoost.fit(X_train, y_train)<br />
</p>
<p>Create and train a CalibratedAdaMECClassifier --being cost-sensitive, it takes C_FP & C_FN as arguments:
</p>
<p>CalAdaMEC = CalibratedAdaMECClassifier(base_estimator, n_estimators, C_FP, C_FN)
</p>
<p>CalAdaMEC = CalAdaMEC.fit(X_train, y_train)
</p>
<p>Produce AdaBoost & Calibrated AdaMEC classifications:
</p>
<p>labels_AdaBoost = AdaBoost.predict(X_test)
</p>
<p>labels_CalibratedAdaMEC = CalAdaMEC.predict(X_test)
</p>
<p>Produce AdaBoost & Calibrated AdaMEC scores (probability estimates) - keep only positive class scores:
</p>
<p>scores_AdaBoost = AdaBoost.predict_proba(X_test)[:,1]
</p>
<p>scores_CalibratedAdaMEC = CalAdaMEC.predict_proba(X_test)[:,1]
</p>
<hr />
<p>Examples of comparison to AdaBoost:
</p>
<p>1)Probability Estimation
</p>
<p>You can evaluate the two algorithms in terms of probability estimation using the Brier score, or the log-loss, found e.g. in the metrics module of scikit-learn. You will see that Calibrated AdaMEC achieves lower scores for both (better probability estimation).
</p>
<p>from sklearn import metrics
</p>
<p>brier_score_AdaBoost = metrics.brier_score_loss(y_test, scores_AdaBoost)
</p>
<p>brier_score_CalibratedAdaMEC = metrics.brier_score_loss(y_test, scores_CalibratedAdaMEC)
</p>
<p>log_loss_AdaBoost = metrics.log_loss(y_test, scores_AdaBoost)
</p>
<p>log_loss_CalibratedAdaMEC = metrics.log_loss(y_test, scores_CalibratedAdaMEC)
</p>
<p>2)Cost-sensitive Classification
</p>
<p>You can evaluate the cost-sensitive behaviour of the classifications produced by the two algorithms in terms of total cost sensitive loss (empirical risk),as shown below. In expectation, the misclassification cost should be lower for Calibrated AdaMEC on asymmetric problems (the greater the skew, the greater the performance gain of Calibrated AdaMEC over AdaBoost).
</p>
<p>from sklearn import metrics
</p>
<p>Pos = sum(y_train[np.where(y_train == 1)]) #Number of positive training examples
</p>
<p>Neg = len(y_train) - Pos #Number of negative training examples
</p>
<p>skew = C_FP*Neg / (C_FN<em>Pos + C_FP</em>Neg) #Skew (combined asymmetry due to both cost and class imbalance)
</p>
<p>conf_mat_AdaBoost = metrics.confusion_matrix(y_test, labels_AdaBoost)#Confusion matrix
</p>
<p>cost_AdaBoost = conf_mat_AdaBoost[0,1]<em>skew + conf_mat_AdaBoost[1,0]</em>(1-skew)#Skew-Sensitive Cost
</p>
<p>conf_mat_CalibratedAdaMEC = metrics.confusion_matrix(y_test, labels_CalibratedAdaMEC)#Confusion matrix
</p>
<p>cost_CalibratedAdaMEC = conf_mat_CalibratedAdaMEC[0,1]<em>skew + conf_mat_CalibratedAdaMEC[1,0]</em>(1-skew)#Skew-Sensitive Cost
</p>
<hr />
<p>Looking For a more flexible implementation?
</p>
<p>The code given here is using Platt scaling (logistic sigmoid calibration) and a 50%-50% train-calibration split. The user is also restricted to using the discrete version of AdaBoost.
</p>
<p>Go to https://github.com/nnikolaou/Cost-sensitive-Boosting-Tutorial for an extended ipython tutorial, providing a summary of the paper and interactive code allowing you to reproduce our experiments and run your own ones, every aspect of which (problem setup, calibration options, ensemble parameters, base learner parameters, evaluation measures) can be modified.
</p></html>Nikolaos Nikolaou, Gavin BrownSat, 08 Apr 2017 13:57:45 -0000http://mloss.org/software/rss/comments/2069http://mloss.org/revision/view/2069/ensemblesadaboostboostingensemble of classifiersensemble methodsensemble learningensemble modelcalibrationclass imbalancecost sensitiveminimum expected costrisk minimizationKeLP 2.2.0http://mloss.org/revision/view/2068/<html><p>Many applications in information and computer technology domains deal with structured data.
For example, in Natural Language Processing (NLP), sentences are typically represented as syntactic parse trees or in Biology, chemical compounds can be represented as undirected graphs.
In contrast, most Machine Learning (ML) methods and toolkits represent data as feature vectors, whose definition and computation is typically costly, especially in case of structured data. For example, the number of times a substructure appears in a structure can be an important feature. However, the number of substructures in a tree grows exponentially with the size of its nodes leading to an exponential number of structural features, which cannot thus be fully exploited in practice.
A solution to the above-mentioned problem is given by Kernel Methods applied with kernel machines, e.g., SVMs or online learning models.
The Kernel-based Learning Platform is a Java framework that aims to facilitate kernel-based learning, in particular on structural data. It contains the implementation of several kernel machines as well as kernel functions, enabling an easy and agile definition of new methods over generic data representations, e.g., vectorial data or discrete structures, such as trees and strings. The framework has been designed to decouple kernel functions and learning algorithms thanks to the definition of specific interfaces. Once a new kernel function is implemented, it can be immediately used in all available kernel-machines, which include different online and batch algorithms for <em>Classification</em>, <em>Regression</em> and <em>Clustering</em>.
The library is highly interoperable: data objects, kernel functions and algorithms are serializable in <em>XML</em> and <em>JSON</em>, enabling the agile definition of kernel-based learning systems. Additionally, such engineering choice allows for defining kernel and algorithm combinations by simply changing parameters in the <em>XML</em> and <em>JSON</em> files (without the need of writing new code).
</p>
<p>Some available <strong>kernels</strong>:
</p>
<ul>
<li><p><em>Tree Kernels</em>: SubTreeKernel, SubSetTreeKernel, PartialTreeKernel, SmoothedPartialTreeKernel, CompositionallySmoothedPartialTreeKernel
</p>
</li>
<li><p><em>Graph Kernels</em>: ShortestPathKernel. Weisfeiler-Lehman Subtree Kernel for Graphs
</p>
</li>
<li><p><em>SequenceKernel</em>
</p>
</li>
<li><p><em>PreferenceKernel</em> and other kernels defined over pairs
</p>
</li>
<li><p><em>Standard Kernels</em>: LinearKernel, PolynomialKernel, RBFKernel, NormalizationKernel, LinearKernelCombination, KernelMultiplication
</p>
</li>
</ul>
<p>Some available <strong>algorithms</strong>:
</p>
<ul>
<li><p><em>Batch Learning</em>: OneClassSVM, C-SVM, nu-SVM, LinearSVM, LinearSVMRegression, epsilon-regression, Dual Coordinate Descent
</p>
</li>
<li><p><em>Online Learning</em>: Perceptron, PassiveAggressive, BudgetedPassiveAggressive, Stoptron, RandomizedPerceptronOnBudget, SoftConfidenceWeightedClassification
</p>
</li>
<li><p><em>Clustering</em>: KernelizedKMean
</p>
</li>
</ul>
<p>NEWS: using KeLP our group won the <a href="http://alt.qcri.org/semeval2017/task3/">SemEval 2017 Task 3 challenge on Community Question Answering</a> and <a href="http://alt.qcri.org/semeval2016/task3/">SemEval 2016 Task 3 challenge on Community Question Answering</a>
</p></html>Simone Filice, Giuseppe Castellucci, Danilo Croce, Roberto Basili, Giovanni Da San Martino, Alessandro MoschittiFri, 07 Apr 2017 16:51:42 -0000http://mloss.org/software/rss/comments/2068http://mloss.org/revision/view/2068/svmclassificationclusteringregressionkernelsonline learningkernel methodsgraph kernelsstructured datalinear modelstree kernelsr-cran-CoxBoost 1.4http://mloss.org/revision/view/1313/<html><p>Cox models by likelihood based boosting for a single survival endpoint or competing risks: This package provides routines for fitting Cox models by likelihood based boosting for a single endpoint or in presence of competing risks
</p></html>Harald BinderSat, 01 Apr 2017 00:00:04 -0000http://mloss.org/software/rss/comments/1313http://mloss.org/revision/view/1313/r-cranr-cran-e1071 1.6-8http://mloss.org/revision/view/2061/<html><p>Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien: Functions for latent class analysis, short time Fourier transform, fuzzy clustering, support vector machines, shortest path computation, bagged clustering, naive Bayes classifier, ...
</p></html>David Meyer [aut, cre], Evgenia Dimitriadou [aut, cph], Kurt Hornik [aut], Andreas Weingessel [aut], Friedrich Leisch [aut], Chih-Chung Chang [ctb, cph] (libsvm C++-code), Chih-Chen Lin [ctb, cph] (liSat, 01 Apr 2017 00:00:04 -0000http://mloss.org/software/rss/comments/2061http://mloss.org/revision/view/2061/r-cranr-cran-Boruta 5.2.0http://mloss.org/revision/view/2053/<html><p>Wrapper Algorithm for All Relevant Feature Selection: An all relevant feature selection wrapper algorithm. It finds relevant features by comparing original attributes' importance with importance achievable at random, estimated using their permuted copies.
</p></html>Miron Bartosz Kursa [aut, cre], Witold Remigiusz Rudnicki [aut]Sat, 01 Apr 2017 00:00:03 -0000http://mloss.org/software/rss/comments/2053http://mloss.org/revision/view/2053/r-cranr-cran-CORElearn 1.50.3http://mloss.org/revision/view/2067/<html><p>Classification, Regression and Feature Evaluation: A suite of machine learning algorithms written in C++ with R interface contains several learning techniques for classification and regression, Predictive models include e.g., classification and regression trees with optional constructive induction and models in the leaves, random forests, kNN, naive Bayes, and locally weighted regression. All predictions obtained with these models can be explained and visualized with ExplainPrediction package. The package is especially strong in feature evaluation where it contains several variants of Relief algorithm and many impurity based attribute evaluation functions, e.g., Gini, information gain, MDL, and DKM. These methods can be used for feature selection or discretization of numeric attributes. The OrdEval algorithm and its visualization is used for evaluation of data sets with ordinal features and class, enabling analysis according to the Kano model of customer satisfaction. Several algorithms support parallel multithreaded execution via OpenMP. The top-level documentation is reachable through ?CORElearn.
</p></html>Marko Robnik-Sikonja, Petr SavickyTue, 28 Mar 2017 00:00:00 -0000http://mloss.org/software/rss/comments/2067http://mloss.org/revision/view/2067/r-cran