Projects supporting the arff data format.


Logo JMLR MultiBoost 1.2.00

by busarobi - April 22, 2013, 15:42:53 CET [ Project Homepage BibTeX BibTeX for corresponding Paper Download ] 14273 views, 2492 downloads, 1 subscription

About: MultiBoost is a multi-purpose boosting package implemented in C++. It is based on the multi-class/multi-task AdaBoost.MH algorithm [Schapire-Singer, 1999]. Basic base learners (stumps, trees, products, Haar filters for image processing) can be easily complemented by new data representations and the corresponding base learners, without interfering with the main boosting engine.

Changes:
  • A new fast (sublinear in the number of instances) stump algorithm is implemented. The gain in time is proportional to the sparsity of the features (it is significant when a lot of instances take the most frequent feature value). See Section B.2 in the documentation.
  • A parametrized early stopping option is added in --traintest mode. We stop if the (smoothed) test error does not improve for a certain number of iterations. See Section 4.1.3 in the documentation.

Logo JMLR Waffles 2013-04-06

by mgashler - April 7, 2013, 02:04:10 CET [ Project Homepage BibTeX BibTeX for corresponding Paper Download ] 16194 views, 5328 downloads, 1 subscription

About: A broad collection of script-friendly command-line tools for machine learning and data mining tasks. (The command-line tools wrap functionality from a C++ class library.)

Changes:

See the change log at http://waffles.sourceforge.net/changelog.html


Logo ADAMS 0.4.2

by fracpete - February 26, 2013, 03:26:25 CET [ Project Homepage BibTeX BibTeX for corresponding Paper Download ] 1797 views, 317 downloads, 1 subscription

About: The Advanced Data mining And Machine learning System (ADAMS) is a novel, flexible workflow engine aimed at quickly building and maintaining real-world, complex knowledge workflows.

Changes:
  • Added almost 20 more conversions and 20 new actors
  • R-Project integration using Rserve
  • WEKA webservice allows for programming language agnostic training, evaluation and use of WEKA models (classifiers, clusterers) and data processing using filters
  • Spreadsheets now come with basic formula support
  • Spreadsheets can be used for lookup tables in the flow
  • Support for "chunked" reading/writing of spreadsheets to process millions of rows

Logo ELKI 0.5.5

by erich - December 14, 2012, 18:49:58 CET [ Project Homepage BibTeX BibTeX for corresponding Paper Download ] 4248 views, 759 downloads, 2 subscriptions

About: ELKI is a framework for implementing data-mining algorithms with support for index structures, that includes a wide variety of clustering and outlier detection methods.

Changes:

This is mostly a bug fix release. A lot of small issues have been fixed that improve performance, make error reporting a lot better, ease the use of sparse vectors and external precomputed distances, for example.

This will be the last ELKI release to support Java 6. The next ELKI release will require Java 7.

Algorithms

  • Some new LOF variants (LDF, SimpleLOF, SimpleKernelDensityLOF)
  • Correlation Outlier Probabilities (ICDM 2012)
  • A naive mean-shift clustering
  • Single-link clustering (SLINK algorithm) should be significantly faster due to optimized data structures
  • "Benchmarking" algorithms for measuring the performance of index structures

Index layer

  • Bulk loading R-Trees should be faster - in particular Sort Tile Recursive can work very well.
  • M-Trees have been refactored and optimized for double distances

Database layer

  • Bundle format (work in progress): low-level binary format for fast data exchange
  • DBID and DataStore layer received some additional classes for further performance improvements
  • KNN heap structures were revisited. The code is less clean now, but performs better in benchmarks.

Visualizations

  • General clean up and API simplifications
  • Some additional modules and improvements

Various

  • There is a new parameter class, RandomParameter
  • Some new distributions were added, also to the data set generator.

Tutorials

  • The website has new tutorials, including one on a k-means variation that produces equal sized clusters.

Logo PREA Personalized Recommendation Algorithms Toolkit 1.1

by srcw - September 1, 2012, 22:53:37 CET [ Project Homepage BibTeX Download ] 1941 views, 623 downloads, 2 subscriptions

About: An open source Java software providing collaborative filtering algorithms.

Changes:

Initial Announcement on mloss.org.


Logo JMLR Mulan 1.4.0

by lefman - August 1, 2012, 09:49:21 CET [ Project Homepage BibTeX BibTeX for corresponding Paper Download ] 9735 views, 4655 downloads, 1 subscription

About: Mulan is an open-source Java library for learning from multi-label datasets. Multi-label datasets consist of training examples of a target function that has multiple binary target variables. This means that each item of a multi-label dataset can be a member of multiple categories or annotated by many labels (classes). This is actually the nature of many real world problems such as semantic annotation of images and video, web page categorization, direct marketing, functional genomics and music categorization into genres and emotions.

Changes:

Learners

  • BinaryRelevance.java: improved data handling that avoids copying the entire input space, leading to important speedups in case of large datasets and very large number of labels.
  • RAkEL.java: updated technical information, added a check for the case where the number of labels is less or equal than the size of the subset.
  • MultiLabelKNN.java: now checks whether the number of instances is less than the number of requested nearest neighbors.
  • Addition of AdaBoostMH.java, an explicit implementation of AdaBoost.MH as combination of AdaBoostM1 and IncludeLabelsClassifier.
  • Addition of MLPTO.java, the Multi Label Probabilistic Threshold Optimizer (MLTPTO) thresholding technique.
  • Addition of ApproximateExampleBasedFMeasureOptimizer.java, an approximate method for the maximization of example-based F-measure.

Measures/Evaluation

  • Addition of Specificity measure (example-based, micro/macro label-based)
  • Addition of Mean Average Interpolated Precision (MAiP), Geometric Mean Average Precision (GMAP), Geometric Mean Average Interpolated Precision (GMAiP).
  • New methods for stratified multi-label evaluation.
  • Added support for outputting per label results for all measures that implement the MacroAverageMeasure interface.
  • Simplifying the "strictness" issue of information retrieval measures, by adopting specific assumptions (outlined in the new class InformationRetrievalMeasures.java) to handle special cases, instead of the less clear and useful solution of outputting NaN and the less realistic solution or ignoring special cases.

Bug fixes

  • Bug fix in LabelsBuilder.java.
  • Bug fix in Ranker.java.
  • Bug-fix in ThresholdPrediction.java.
  • Fix for bug occurring when loading the XSD for mulan data outside the command-line environment (e.g. web applications).
  • Javadoc comment updates.

API changes

  • Upgrade to Java 1.6
  • Upgrade to JUnit 4.10
  • Upgrade to Weka 3.7.6.

Miscellaneous

  • Meaningful messages are now shown when a DataLoadException is thrown.
  • PT6(PT6Transformation.java): renamed to IncludeLabelsTransformation.java.
  • MultiLabelInstances now support serialization, as needed by the improved binary relevance transformation.
  • BinaryRelevanceAttributeEvaluator.java: updated according to latest BR improvements.

Logo MLWizard 5.2

by remat - July 26, 2012, 15:04:14 CET [ Project Homepage BibTeX Download ] 1642 views, 355 downloads, 1 subscription

About: MLwizard recommends and optimizes classification algorithms based on meta-learning and is a software wizard fully integrated into RapidMiner but can be used as library as well.

Changes:

Faster parameter optimization using genetic algorithm with predefined start population.


Logo WebEnsemble 1.0

by jungc005 - May 8, 2012, 22:24:44 CET [ BibTeX Download ] 877 views, 223 downloads, 1 subscription

About: Use the power of crowdsourcing to create ensembles.

Changes:

Initial Announcement on mloss.org.


Logo MLFlex 02-21-2012-00-12

by srp33 - April 3, 2012, 16:44:43 CET [ Project Homepage BibTeX BibTeX for corresponding Paper Download ] 1065 views, 181 downloads, 1 subscription

About: Motivated by a need to classify high-dimensional, heterogeneous data from the bioinformatics domain, we developed ML-Flex, a machine-learning toolbox that enables users to perform two-class and multi-class classification analyses in a systematic yet flexible manner. ML-Flex was written in Java but is capable of interfacing with third-party packages written in other programming languages. It can handle multiple input-data formats and supports a variety of customizations. MLFlex provides implementations of various validation strategies, which can be executed in parallel across multiple computing cores, processors, and nodes. Additionally, ML-Flex supports aggregating evidence across multiple algorithms and data sets via ensemble learning. (See http://jmlr.csail.mit.edu/papers/volume13/piccolo12a/piccolo12a.pdf.)

Changes:

Initial Announcement on mloss.org.


Logo NaN toolbox 2.5.2

by schloegl - February 10, 2012, 11:45:52 CET [ Project Homepage BibTeX Download ] 21530 views, 4108 downloads, 1 subscription

About: NaN-toolbox is a statistics and machine learning toolbox for handling data with and without missing values.

Changes:

Changes in v.2.5.2 - faster version of quantile if multiple quantiles are requested - removes the dependency on ZLIB and thus - fixes "pkg install nan" for Octave on Windows - a number of minor improvements

For details see the CHANGELOG at http://pub.ist.ac.at/~schloegl/matlab/NaN/CHANGELOG


Logo mldata.org svn-r1070-Apr-2011

by sonne - April 8, 2011, 10:15:49 CET [ Project Homepage BibTeX Download ] 2870 views, 421 downloads, 1 subscription

About: The source code of the mldata.org site - a community portal for machine learning data sets.

Changes:

Initial Announcement on mloss.org.


Logo mldata-utils 0.5.0

by sonne - April 8, 2011, 10:02:44 CET [ Project Homepage BibTeX Download ] 13917 views, 2567 downloads, 1 subscription

About: Tools to convert datasets from various formats to various formats, performance measures and API functions to communicate with mldata.org

Changes:
  • Change task file format, such that data splits can have a variable number items and put into up to 256 categories of training/validation/test/not used/...
  • Various bugfixes.

Logo Apache Mahout 0.4

by gsingers - November 2, 2010, 04:28:34 CET [ Project Homepage BibTeX Download ] 10452 views, 3421 downloads, 2 subscriptions

About: Apache Mahout is an Apache Software Foundation project with the goal of creating both a community of users and a scalable, Java-based framework consisting of many machine learning algorithm [...]

Changes:

We are pleased to announce release 0.4 of Mahout. Virtually every corner of the project has changed, and significantly, since 0.3. Developers are invited to use and depend on version 0.4 even as yet more change is to be expected before the next release. Highlights include:

* Model refactoring and CLI changes to improve integration and consistency
* New ClusterEvaluator and CDbwClusterEvaluator offer new ways to evaluate clustering effectiveness
* New Spectral Clustering and MinHash Clustering (still experimental)
* New VectorModelClassifier allows any set of clusters to be used for classification
* Map/Reduce job to compute the pairwise similarities of the rows of a matrix using a customizable similarity measure
* Map/Reduce job to compute the item-item-similarities for item-based collaborative filtering
* RecommenderJob has been evolved to a fully distributed item-based recommender
* Distributed Lanczos SVD implementation
* More support for distributed operations on very large matrices
* Easier access to Mahout operations via the command line
* New HMM based sequence classification from GSoC (currently as sequential version only and still experimental)
* Sequential logistic regression training framework
* New SGD classifier
* Experimental new type of NB classifier, and feature reduction options for existing one
* New vector encoding framework for high speed vectorization without a pre-built dictionary
* Additional elements of supervised model evaluation framework
* Promoted several pieces of old Colt framework to tested status (QR decomposition, in particular)
* Can now save random forests and use it to classify new data
* Many, many small fixes, improvements, refactorings and cleanup

Logo Pyriel 1.5

by tfawcett - October 27, 2010, 09:12:53 CET [ BibTeX BibTeX for corresponding Paper Download ] 8555 views, 1592 downloads, 1 subscription

About: Pyriel is a Python system for learning classification rules from data. Unlike other rule learning systems, it is designed to learn rule lists that maximize the area under the ROC curve (AUC) instead of accuracy. Pyriel is mostly an experimental research tool, but it's robust and fast enough to be used for lightweight industrial data mining.

Changes:

1.5 Changed CF (confidence factor) to do LaPlace smoothing of estimates. New flag "--score-for-class C" causes scores to be computed relative to a given (positive) class. For two-class problems. Fixed bug in example sampling code (--sample n) Fixed bug keeping old-style example formats (terminated by dot) from working. More code restructuring.


Logo pHMM4weka 1.0

by smm52 - October 22, 2010, 03:48:07 CET [ Project Homepage BibTeX Download ] 2417 views, 700 downloads, 1 subscription

About: This Java software implements Profile Hidden Markov Models (PHMMs) for protein classification for the WEKA workbench. Standard PHMMs and newly introduced binary PHMMs are used. In addition the software allows propositionalisation of PHMMs.

Changes:

description changed


Logo JMLR MOA Massive Online Analysis June-09

by abifet - June 4, 2010, 14:05:31 CET [ Project Homepage BibTeX BibTeX for corresponding Paper Download ] 7824 views, 3251 downloads, 1 subscription

About: Massive Online Analysis (MOA) is a real time analytic tool for data streams. It is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. MOA includes a collection of offline and online methods as well as tools for evaluation. In particular, it implements boosting, bagging, and Hoeffding Trees, all with and without Naive Bayes classifiers at the leaves. MOA supports bi-directional interaction with WEKA, the Waikato Environment for Knowledge Analysis, and it is released under the GNU GPL license.

Changes:

Initial Announcement on mloss.org.


Logo ELF Ensemble Learning Framework 0.1

by mjahrer - May 10, 2010, 23:54:53 CET [ Project Homepage BibTeX BibTeX for corresponding Paper Download ] 3853 views, 603 downloads, 1 subscription

About: ELF provides many well implemented supervised learners for classification and regression tasks with an opportunity of ensemble learning.

Changes:

Initial Announcement on mloss.org.


Logo Debellor 1.0

by mwojnars - July 30, 2009, 16:48:05 CET [ Project Homepage BibTeX BibTeX for corresponding Paper Download ] 5831 views, 1786 downloads, 1 subscription

About: Debellor is a scalable and extensible platform which provides common architecture for data mining and machine learning algorithms of various types.

Changes:
  • Naming of numerous classes/methods/fields changed to be more accurate and comprehensible
  • Weka and Rseslib libraries updated to the newest versions: Weka 3.6.1 & Rseslib 3.0.1. Debellor's wrappers adapted
  • New class: CrossValidation - evaluator of trainable cells through cross-validation
  • New class: RMSE - calculation of Root Mean Squared Error score
  • Data objects can be compared and used in collections
  • ArffReader can read from a user-provided java.io.InputStream
  • More convenient use of parameters (setting values)
  • More convenient use of data objects and data types (construction, type casting)
  • Other minor improvements to existing classes
  • Javadoc extended