Projects supporting the arff data format.
Showing Items 1-20 of 28 on page 1 of 2: 1 2 Next

Logo python weka wrapper 0.3.3

by fracpete - September 26, 2015, 06:11:42 CET [ Project Homepage BibTeX Download ] 17206 views, 3690 downloads, 3 subscriptions

About: A thin Python wrapper that uses the javabridge Python library to communicate with a Java Virtual Machine executing Weka API calls.

  • updated to Weka 3.7.13
  • documentation now covers the API as well

Logo KEEL Knowledge Extraction based on Evolutionary Learning 3.0

by keel - September 18, 2015, 12:38:54 CET [ Project Homepage BibTeX Download ] 405 views, 111 downloads, 1 subscription

About: KEEL (Knowledge Extraction based on Evolutionary Learning) is an open source (GPLv3) Java software tool that can be used for a large number of different knowledge data discovery tasks. KEEL provides a simple GUI based on data flow to design experiments with different datasets and computational intelligence algorithms (paying special attention to evolutionary algorithms) in order to assess the behavior of the algorithms. It contains a wide variety of classical knowledge extraction algorithms, preprocessing techniques (training set selection, feature selection, discretization, imputation methods for missing values, among others), computational intelligence based learning algorithms, hybrid models, statistical methodologies for contrasting experiments and so forth. It allows to perform a complete analysis of new computational intelligence proposals in comparison to existing ones. Moreover, KEEL has been designed with a two-fold goal: research and educational. KEEL is also coupled with KEEL-dataset: a webpage that aims at providing to the machine learning researchers a set of benchmarks to analyze the behavior of the learning methods. Concretely, it is possible to find benchmarks already formatted in KEEL format for classification (such as standard, multi instance or imbalanced data), semi-supervised classification, regression, time series and unsupervised learning. Also, a set of low quality data benchmarks is maintained in the repository.


Initial Announcement on

Logo ELKI 0.7.0-20150828

by erich - September 17, 2015, 10:20:30 CET [ Project Homepage BibTeX BibTeX for corresponding Paper Download ] 14695 views, 2698 downloads, 4 subscriptions

About: ELKI is a framework for implementing data-mining algorithms with support for index structures, that includes a wide variety of clustering and outlier detection methods.


Additions and Improvements from ELKI 0.6.0:

  • Uncertain data types, and clustering algorithms for uncertain data.

  • Major refactoring of distances - removal of Distance values and removed support for non-double-valued distance functions. While this reduces the generality of ELKI, we could remove about 2.5% of the codebase by not having to have optimized codepaths for double-distance anymore. Generics for distances were present in almost any distance-based algorithm, and we were also happy to reduce the use of generics this way. Support for non-double-valued distances can trivially be added again, e.g. by adding the specialization one level higher: at the query instead of the distance level, for example.

  • In this process, we also removed the Generics from NumberVector. The object-based get was deprecated for a good reason long ago, and e.g. doubleValue are more efficient (even for non-DoubleVectors).

  • Dropped some long-deprecated classes

Clustering algorithms:


  • speedups for some initialization heuristics
  • K-means++ initialization no longer squares distances (again)
  • farthest-point heuristics now uses minimum instead of sum (renamed)
  • additional evaluation criteria
  • Elkan's and Hamerly's faster k-means variants

CLARA clustering


Hierarchical clustering

  • Renamed naive algorithm to AGNES
  • Anderbergs algorithm (faster than AGNES, slower than SLINK)
  • CLINK for complete linkage clustering in O(n²) time, O(n) memory
  • Simple extraction from HDBSCAN
  • "Optimal" extraction from HDBSCAN
  • HDBSCAN, in two variants

LSDBC clustering

EM clustering was refactored and moved into its own package. The new version is much more extensible.

Parallel computation framework, and some parallelized algorithms

  • Parallel k-means
  • Parallel LOF and variants


  • LibSVM format parser


  • kNN classification (with index acceleration)

Evaluation: Internal cluster evaluation:

  • Silhouette index
  • Simplified Silhouette index (faster)
  • Davis-Bouldin index
  • PBM index
  • Variance-Ratio-Criteria
  • Sum of squared errors
  • C-Index
  • Concordant pair indexes (Gamma, Tau)
  • Different noise handling strategies for internal indexes

Statistical dependence measures:

  • Distance correlation dCor.
  • Hoeffings D.
  • Some divergence / mutual information measures.

Distance functions:

  • Big refactoring.
  • Time series distances refactored, allow variable length series now.
  • Hellinger distance and kernel function.


  • Faster MDS implementation using power iterations.

Indexing improvements:

  • Precomputed distance matrix "index".
  • iDistance index (static only).
  • Inverted-list index for sparse data and cosine/arccosine distance.
  • cover tree index (static only).

Frequent Itemset Mining:

  • Improved APRIORI implementation.
  • FP-Growth added.
  • Eclat (basic version only) added.

Uncertain clustering:

  • Discrete and continuous data models
  • FDBSCAN clustering
  • UKMeans clustering
  • CKMeans clustering
  • Representative Uncertain Clustering (Meta-algorithm)
  • Center-of-mass meta Clustering (allows using other clustering algorithms on uncertain objects) (KDD'14)

Outlier detection changes / smaller improvements:

  • KDEOS outlier detection (SDM14)
  • k-means based outlier detection (distance to centroid) and Silhouette coefficient based approach (which does not work too well on the toy data sets - the lowest silhouette are usually where two clusters touch).
  • bug fix in kNN weight, when distances are tied and kNN yields more than k results.
  • kNN and kNN weight outlier have their k parameter changed: old 2NN outlier is now 1NN outlier, as commonly understood in classification literature (1 nearest neighbor ''other than the query object''; whereas in database literature the 1NN is usually the query object itself). You can get the old result back by decreasing k by one easily.
  • LOCI implementation is now only O(n^3 log n) instead of O(n^4).


  • MiniGUI has two "secret" new options: -minigui.last -minigui.autorun to load the last saved configuration and run it, for convenience.

  • Logging API has been extended, to make logging more convenient in a number of places (saving some lines for progress logging and timing).

Logo PyScriptClassifier 0.0.1

by cjb60 - August 15, 2015, 05:14:59 CET [ Project Homepage BibTeX Download ] 526 views, 149 downloads, 1 subscription

About: Easily prototype WEKA classifiers using Python scripts.


Initial Announcement on

Logo NaN toolbox 2.8.1

by schloegl - July 6, 2015, 22:43:23 CET [ Project Homepage BibTeX Download ] 37829 views, 7882 downloads, 3 subscriptions

About: NaN-toolbox is a statistics and machine learning toolbox for handling data with and without missing values.


Changes in v.2.8.1 - number of bug fixes - compatibility issues with recent versions of Octave are addressed - upgrade to libsvm 3-12

For details see the CHANGELOG at

Logo ADAMS 0.4.10

by fracpete - June 22, 2015, 23:14:58 CET [ Project Homepage BibTeX BibTeX for corresponding Paper Download ] 14040 views, 2859 downloads, 3 subscriptions

About: The Advanced Data mining And Machine learning System (ADAMS) is a novel, flexible workflow engine aimed at quickly building and maintaining real-world, complex knowledge workflows.

  • fixes a glitch in the debugging functionality, when using the Breakpoint control actor

Logo JMLR Mulan 1.5.0

by lefman - February 23, 2015, 21:19:05 CET [ Project Homepage BibTeX BibTeX for corresponding Paper Download ] 19254 views, 7044 downloads, 2 subscriptions

About: Mulan is an open-source Java library for learning from multi-label datasets. Multi-label datasets consist of training examples of a target function that has multiple binary target variables. This means that each item of a multi-label dataset can be a member of multiple categories or annotated by many labels (classes). This is actually the nature of many real world problems such as semantic annotation of images and video, web page categorization, direct marketing, functional genomics and music categorization into genres and emotions.



  • Added the MLCSSP algorithm (from ICML 2013)
  • Enhancements of multi-target regression capabilities
  • Improved CLUS support
  • Added pairwise classifier and pairwise transformation


  • Providing training data in the Evaluator is unnecessary in the case of specific measures.
  • Examples with missing ground truth are not skipped for measures that handle missing values.
  • Added logistics and squared error losses and measures

Bug fixes

  • IndexOutOfBounds in calculation of MiAP and GMiAP
  • Bug fix in
  • When in rank/score mode the meta-data contained additional unecessary attributes. (Newton Spolaor)

API changes

  • Upgrade to Java 7
  • Upgrade to Weka 3.7.10


  • Small changes and improvements in the wrapper classes for the CLUS library
  • (new experiment)
  • Enumeration is now used for specifying the type of meta-data. (Newton Spolaor)

Logo Hub Miner 1.1

by nenadtomasev - January 22, 2015, 16:33:51 CET [ Project Homepage BibTeX BibTeX for corresponding Paper Download ] 2272 views, 483 downloads, 2 subscriptions

About: Hubness-aware Machine Learning for High-dimensional Data

  • BibTex support for all algorithm implementations, making all of them easy to reference (via algref package).

  • Two more hubness-aware approaches (meta-metric-learning and feature construction)

  • An implementation of Hit-Miss networks for analysis.

  • Several minor bug fixes.

  • The following instance selection methods were added: HMScore, Carving, Iterative Case Filtering, ENRBF.

  • The following clustering quality indexes were added: Folkes-Mallows, Calinski-Harabasz, PBM, G+, Tau, Point-Biserial, Hubert's statistic, McClain-Rao, C-root-k.

  • Some more experimental scripts have been included.

  • Extensions in the estimation of hubness risk.

  • Alias and weighted reservoir methods for weight-proportional random selection.

Logo JEMLA 1.0

by bathaeian - January 4, 2015, 08:34:49 CET [ Project Homepage BibTeX Download ] 962 views, 313 downloads, 3 subscriptions

About: Java package for calculating Entropy for Machine Learning Applications. It has implemented several methods of handling missing values. So it can be used as a lab for examining missing values.


Discretizing numerical values is added to calculate mode of values and fractional replacement of missing ones. class diagram is on the web

Logo JMLR JKernelMachines 2.5

by dpicard - December 11, 2014, 17:51:42 CET [ Project Homepage BibTeX BibTeX for corresponding Paper Download ] 20826 views, 4847 downloads, 4 subscriptions

Rating Whole StarWhole StarWhole StarWhole Star1/2 Star
(based on 4 votes)

About: machine learning library in java for easy development of new kernels


Version 2.5

  • New active learning algorithms
  • Better threading management
  • New multiclass SVM algorithm based on SDCA
  • Handle class balancing in cross-validation
  • Optional support of EJML switch to version 0.26
  • Various bugfixes and improvements

Logo pySPACE 1.2

by krell84 - October 29, 2014, 15:36:28 CET [ Project Homepage BibTeX BibTeX for corresponding Paper Download ] 3775 views, 813 downloads, 1 subscription

About: pySPACE is the abbreviation for "Signal Processing and Classification Environment in Python using YAML and supporting parallelization". It is a modular software for processing of large data streams that has been specifically designed to enable distributed execution and empirical evaluation of signal processing chains. Various signal processing algorithms (so called nodes) are available within the software, from finite impulse response filters over data-dependent spatial filters (e.g. CSP, xDAWN) to established classifiers (e.g. SVM, LDA). pySPACE incorporates the concept of node and node chains of the MDP framework. Due to its modular architecture, the software can easily be extended with new processing nodes and more general operations. Large scale empirical investigations can be configured using simple text- configuration files in the YAML format, executed on different (distributed) computing modalities, and evaluated using an interactive graphical user interface.


improved testing, improved documentation, windows compatibility, more algorithms

Logo JMLR Waffles 2014-07-05

by mgashler - July 20, 2014, 04:53:54 CET [ Project Homepage BibTeX BibTeX for corresponding Paper Download ] 30840 views, 8475 downloads, 2 subscriptions

About: Script-friendly command-line tools for machine learning and data mining tasks. (The command-line tools wrap functionality from a public domain C++ class library.)


Added support for CUDA GPU-parallelized neural network layers, and several other new features. Full list of changes at

Logo JMLR MOA Massive Online Analysis Nov-13

by abifet - April 4, 2014, 03:50:20 CET [ Project Homepage BibTeX BibTeX for corresponding Paper Download ] 14562 views, 5563 downloads, 1 subscription

About: Massive Online Analysis (MOA) is a real time analytic tool for data streams. It is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. MOA includes a collection of offline and online methods as well as tools for evaluation. In particular, it implements boosting, bagging, and Hoeffding Trees, all with and without Naive Bayes classifiers at the leaves. MOA supports bi-directional interaction with WEKA, the Waikato Environment for Knowledge Analysis, and it is released under the GNU GPL license.


New version November 2013

Logo SAMOA 0.0.1

by gdfm - April 2, 2014, 17:09:08 CET [ Project Homepage BibTeX Download ] 1361 views, 404 downloads, 1 subscription

About: SAMOA is a platform for mining big data streams. It is a distributed streaming machine learning (ML) framework that contains a programing abstraction for distributed streaming ML algorithms.


Initial Announcement on

Logo JMLR MultiBoost 1.2.02

by busarobi - March 31, 2014, 16:13:04 CET [ Project Homepage BibTeX BibTeX for corresponding Paper Download ] 30965 views, 5261 downloads, 1 subscription

About: MultiBoost is a multi-purpose boosting package implemented in C++. It is based on the multi-class/multi-task AdaBoost.MH algorithm [Schapire-Singer, 1999]. Basic base learners (stumps, trees, products, Haar filters for image processing) can be easily complemented by new data representations and the corresponding base learners, without interfering with the main boosting engine.


Major changes :

  • The “early stopping” feature can now based on any metric output with the --outputinfo command line argument.

  • Early stopping now works with --slowresume command line argument.

Minor fixes:

  • More informative output when testing.

  • Various compilation glitch with recent clang (OsX/Linux).

Logo Chordalysis 1.0

by fpetitjean - March 24, 2014, 01:22:06 CET [ Project Homepage BibTeX BibTeX for corresponding Paper Download ] 1909 views, 449 downloads, 1 subscription

About: Log-linear analysis for high-dimensional data


Initial Announcement on

Logo CIlib Computational Intelligence Library 0.8

by gpampara - August 22, 2013, 08:34:21 CET [ Project Homepage BibTeX BibTeX for corresponding Paper Download ] 2238 views, 656 downloads, 1 subscription

About: CIlib is a library of computational intelligence algorithms and supporting components that allows simple extension and experimentation. The library is peer reviewed and is backed by a leading research group in the field. The library is under active development.


Initial Announcement on

Logo Apache Mahout 0.8

by gsingers - July 27, 2013, 15:52:32 CET [ Project Homepage BibTeX Download ] 17885 views, 4771 downloads, 2 subscriptions

About: Apache Mahout is an Apache Software Foundation project with the goal of creating both a community of users and a scalable, Java-based framework consisting of many machine learning algorithm [...]


Apache Mahout 0.8 contains, amongst a variety of performance improvements and bug fixes, an implementation of Streaming K-Means, deeper Lucene/Solr integration and new scalable recommender algorithms. For a full description of the newest release, see

Logo PREA Personalized Recommendation Algorithms Toolkit 1.1

by srcw - September 1, 2012, 22:53:37 CET [ Project Homepage BibTeX Download ] 10058 views, 2587 downloads, 2 subscriptions

About: An open source Java software providing collaborative filtering algorithms.


Initial Announcement on

Logo MLWizard 5.2

by remat - July 26, 2012, 15:04:14 CET [ Project Homepage BibTeX Download ] 3896 views, 965 downloads, 1 subscription

About: MLwizard recommends and optimizes classification algorithms based on meta-learning and is a software wizard fully integrated into RapidMiner but can be used as library as well.


Faster parameter optimization using genetic algorithm with predefined start population.

Showing Items 1-20 of 28 on page 1 of 2: 1 2 Next