
 Description:
ELKI: "Environment for Developing KDDApplications Supported by IndexStructures" is a development framework for data mining algorithms written in Java. It includes a large variety of popular data mining algorithms, distance functions and index structures.
Its focus is particularly on clustering and outlier detection methods, in contrast to many other data mining toolkits that focus on classification. Additionally, it includes support for index structures to improve algorithm performance such as R*Tree and MTree.
The modular architecture is meant to allow adding custom components such as distance functions or algorithms, while being able to reuse the other parts for evaluation.
This package also includes the source code, since this software is meant for the rapid development of such algorithms, not so much for end users.
 Changes to previous version:
Additions and Improvements from ELKI 0.5.5:
Algorithms
Clustering:
 Hierarchical Clustering  the slower naive variants were added, and the code was refactored
 Partition extraction from hierarchical clusterings  different linkage strategies (e.g. Ward)
 Canopy preClustering
 Naive MeanShift Clustering
 Affinity propagation clustering (both with distances and similarities / kernel functions)
 Kmeans variations: Bestofmultipleruns, bisecting kmeans
 New kmeans initialization: farthest points, sample initialization
 Cheng and Church Biclustering
 P3C Subspace Clustering
 Onedimensional clustering algorithm based on kernel density estimation
Outlier detection
 COP  correlation outlier probabilities
 LDF  a kernel density based LOF variant
 Simplified LOF  a simpler version of LOF (not using reachability distance)
 Simple Kernel Density LOF  a simple LOF using kernel density (more consistent than LDF)
 Simple outlier ensemble algorithm
 PINN  projection indexed nearest neighbors, via projected indexes.
 ODIN  kNN graph based outlier detection
 DWOF  DynamicWindow Outlier Factor (contributed by Omar Yousry)
 ABOD refactored, into ABOD, FastABOD and LBABOD
Distances
 Geodetic distances now support different world models (WGS84 etc.) and are subtantially faster.
 Levenshtein distances for processing strings, e.g. for analyzing phonemes (contributed code, see "Word segmentation through crosslingual wordtophoneme alignment", SLT2013, Stahlberg et al.)
 BrayCurtis, Clark, Kulczynski1 and Lorentzian distances with Rtree indexing support
 Histogram matching distances
 Probabilistic divergence distances (Jeffrey, JensenShannon, Chi2, KullbackLeibler)
 Kulczynski2 similarity
 Kernel similarity code has been refactored, and additional kernel functions have been added
Database Layer and Data Types
Projection layer * Parser for simple textual data (for use with Levenshtein distance) Various random projection families (including Feature Bagging, Achlioptas, and pstable) Latitude+Longitude to ECEF Sparse vector improvements and bug fixes New filter: remove NaN values and missing values New filter: add histogrambased jitter New filter: normalize using statistical distributions New filter: robust standardization using Median and MAD New filter: Linear discriminant analysis (LDA)
Index Layer
 Another speed up in Rtrees
 Refactoring of M and Rtrees: Support for different strategies in Mtree New strategies for Mtree splits Speedups in Mtree
 New index structure: inmemory kdtree
 New index structure: inmemory Locality Sensitive Hashing (LSH)
 New index structure: approximate projected indexes, such as PINN
 Index support for geodetic data  (Details: Geodetic Distance Queries on RTrees for Indexing Geographic Data, SSTD13)
 Sampled k nearest neighbors: reference KDD13 "Subsampling for Efficient and Effective Unsupervised Outlier Detection Ensembles"
 Cached (precomputed) knearest neighbors to share across multiple runs
 Benchmarking "algorithms" for indexes
Mathematics and Statistics
 Many new distributions have been added, now 28 different distributions are supported
 Additional estimation methods (using advanced statistics such as LMoments), now 44 estimators are available
 Trimming and Winsorizing
 Automatic bestfit distribution estimation
 Preprocessor using these distributions for rescaling data sets
 API changes related to the new distributions support
 More kernel density functions
 RANSAC covariance matrix builder (unfortunately rather slow)
Visualization
 3D projected coordinates (Details: Interactive Data Mining with 3DParallelCoordinateTrees, SIGMOD2013)
 Convex hulls now also include nested hierarchical clusters
Other
 Parser speedups
 Sparse vector bug fixes and improvements
 Various bug fixes
 PCA, MDS and LDA filters
 Text output was slightly improved (but still needs to be redesigned from scratch  please contribute!)
 Refactoring of hierarchy classes
 New heap classes and infrastructure enhancements
 Classes can have aliases, e.g. "l2" for euclidean distance.
 Some error messages were made more informative.
 Benchmarking classes, also for approximate nearest neighbor search.
 BibTeX Entry: Download
 Corresponding Paper BibTeX Entry: Download
 URL: Project Homepage
 Supported Operating Systems: Platform Independent
 Data Formats: Arff, Other, Csv, Parser Extension Api
 Tags: Clustering, Visualization, Algorithms, Evaluation, Anomaly Detection, Outlier Detection, Index Structures
 Archive: download here
Other available revisons

Version Changelog Date 0.6.0 Additions and Improvements from ELKI 0.5.5:
Algorithms
Clustering:
 Hierarchical Clustering  the slower naive variants were added, and the code was refactored
 Partition extraction from hierarchical clusterings  different linkage strategies (e.g. Ward)
 Canopy preClustering
 Naive MeanShift Clustering
 Affinity propagation clustering (both with distances and similarities / kernel functions)
 Kmeans variations: Bestofmultipleruns, bisecting kmeans
 New kmeans initialization: farthest points, sample initialization
 Cheng and Church Biclustering
 P3C Subspace Clustering
 Onedimensional clustering algorithm based on kernel density estimation
Outlier detection
 COP  correlation outlier probabilities
 LDF  a kernel density based LOF variant
 Simplified LOF  a simpler version of LOF (not using reachability distance)
 Simple Kernel Density LOF  a simple LOF using kernel density (more consistent than LDF)
 Simple outlier ensemble algorithm
 PINN  projection indexed nearest neighbors, via projected indexes.
 ODIN  kNN graph based outlier detection
 DWOF  DynamicWindow Outlier Factor (contributed by Omar Yousry)
 ABOD refactored, into ABOD, FastABOD and LBABOD
Distances
 Geodetic distances now support different world models (WGS84 etc.) and are subtantially faster.
 Levenshtein distances for processing strings, e.g. for analyzing phonemes (contributed code, see "Word segmentation through crosslingual wordtophoneme alignment", SLT2013, Stahlberg et al.)
 BrayCurtis, Clark, Kulczynski1 and Lorentzian distances with Rtree indexing support
 Histogram matching distances
 Probabilistic divergence distances (Jeffrey, JensenShannon, Chi2, KullbackLeibler)
 Kulczynski2 similarity
 Kernel similarity code has been refactored, and additional kernel functions have been added
Database Layer and Data Types
Projection layer * Parser for simple textual data (for use with Levenshtein distance) Various random projection families (including Feature Bagging, Achlioptas, and pstable) Latitude+Longitude to ECEF Sparse vector improvements and bug fixes New filter: remove NaN values and missing values New filter: add histogrambased jitter New filter: normalize using statistical distributions New filter: robust standardization using Median and MAD New filter: Linear discriminant analysis (LDA)
Index Layer
 Another speed up in Rtrees
 Refactoring of M and Rtrees: Support for different strategies in Mtree New strategies for Mtree splits Speedups in Mtree
 New index structure: inmemory kdtree
 New index structure: inmemory Locality Sensitive Hashing (LSH)
 New index structure: approximate projected indexes, such as PINN
 Index support for geodetic data  (Details: Geodetic Distance Queries on RTrees for Indexing Geographic Data, SSTD13)
 Sampled k nearest neighbors: reference KDD13 "Subsampling for Efficient and Effective Unsupervised Outlier Detection Ensembles"
 Cached (precomputed) knearest neighbors to share across multiple runs
 Benchmarking "algorithms" for indexes
Mathematics and Statistics
 Many new distributions have been added, now 28 different distributions are supported
 Additional estimation methods (using advanced statistics such as LMoments), now 44 estimators are available
 Trimming and Winsorizing
 Automatic bestfit distribution estimation
 Preprocessor using these distributions for rescaling data sets
 API changes related to the new distributions support
 More kernel density functions
 RANSAC covariance matrix builder (unfortunately rather slow)
Visualization
 3D projected coordinates (Details: Interactive Data Mining with 3DParallelCoordinateTrees, SIGMOD2013)
 Convex hulls now also include nested hierarchical clusters
Other
 Parser speedups
 Sparse vector bug fixes and improvements
 Various bug fixes
 PCA, MDS and LDA filters
 Text output was slightly improved (but still needs to be redesigned from scratch  please contribute!)
 Refactoring of hierarchy classes
 New heap classes and infrastructure enhancements
 Classes can have aliases, e.g. "l2" for euclidean distance.
 Some error messages were made more informative.
 Benchmarking classes, also for approximate nearest neighbor search.
January 10, 2014, 18:32:28 0.6.0beta1 New beta release, including some new algorithms (ODIN, PINN, full O(n^3) Hierarchical Clustering, new cluster extraction methods from hierarchies), new index structures (inmemory kd tree, LSH, projected indexes, PINN), new visualizations and much more.
This release requires Java 7, for the new visualizations also JOGL will be needed.
June 23, 2013, 21:28:33 0.5.5 This is mostly a bug fix release. A lot of small issues have been fixed that improve performance, make error reporting a lot better, ease the use of sparse vectors and external precomputed distances, for example.
This will be the last ELKI release to support Java 6. The next ELKI release will require Java 7.
Algorithms
 Some new LOF variants (LDF, SimpleLOF, SimpleKernelDensityLOF)
 Correlation Outlier Probabilities (ICDM 2012)
 A naive meanshift clustering
 Singlelink clustering (SLINK algorithm) should be significantly faster due to optimized data structures
 "Benchmarking" algorithms for measuring the performance of index structures
Index layer
 Bulk loading RTrees should be faster  in particular Sort Tile Recursive can work very well.
 MTrees have been refactored and optimized for double distances
Database layer
 Bundle format (work in progress): lowlevel binary format for fast data exchange
 DBID and DataStore layer received some additional classes for further performance improvements
 KNN heap structures were revisited. The code is less clean now, but performs better in benchmarks.
Visualizations
 General clean up and API simplifications
 Some additional modules and improvements
Various
 There is a new parameter class, RandomParameter
 Some new distributions were added, also to the data set generator.
Tutorials
 The website has new tutorials, including one on a kmeans variation that produces equal sized clusters.
December 14, 2012, 18:49:58 0.5.0 Primary release goals:
Cluster evaluation: metrics and circlesegmentvisualization (ICDE 2012)
Outlier detection ensembles (SDM 2011, 2012)
Usability improvements, for example by adding an automatic evaluation helper
Performance improvements by reducing boxing of primitive types
Parallel coordinates visualizations added for highdimensional data
Tons of new algorithms, distance functions, index structures, visualizations, evaluators, ...
http://elki.dbs.ifi.lmu.de/wiki/Releases/ReleaseNotes0.5.0
July 1, 2012, 20:58:25 0.5.0 beta2 The full changelog is not yet up. Here is an excerpt of the new functions in 0.5.0  further speed improvements  RTree flexibility: multiple new split strategies, bulk loaders, insertion strategies, so that ELKI can now do many RTree variations, including the original Guttman RTree, not only the R*Tree.  KMeans flexibility: MacQueen and Lloyd style iterations along with various seeding strategies, including KMeans++  VAFile (static only, not dynamic databases)  Many popular cluster evaluation measures  Alpha shapes, Voronoi cells, Delaunay triangulations in the visualization layer (in the projected space, so 2D!)  Parallel coordinates  Outlier ensemble code, presented at SDM 2012  Some new algorithms, such as OUTRES
For the final 0.5.0 release we hope to have some approximate outlier detection methods for you (aLOCI, HilOut) as well as some subspace outlier detection methods including HiCS (ICDE 2012, to be presented tomorrow).
June 1, 2012, 21:32:08 0.5.0 beta1 The full changelog is not yet up. Here is an excerpt of the new functions in 0.5.0  further speed improvements  RTree flexibility: multiple new split strategies, bulk loaders, insertion strategies, so that ELKI can now do many RTree variations, including the original Guttman RTree, not only the R*Tree.  KMeans flexibility: MacQueen and Lloyd style iterations along with various seeding strategies, including KMeans++  VAFile (static only, not dynamic databases); partialVA to come for 0.5.0 final?  Many popular cluster evaluation measures  Alpha shapes, Voronoi cells, Delaunay triangulations in the visualization layer (in the projected space, so 2D!)  Parallel coordinates (only halfway reviewed in beta1, more to come!)  Outlier ensemble code, to be presented at SDM 2012 end of april
For the final 0.5.0 release we hope to have some approximate outlier detection methods for you (aLOCI, HilOut) as well as some subspace outlier detection methods including HiCS (ICDE 2012, to be presented tomorrow).
May 9, 2012, 20:46:08 0.4.1 Bug fix release with a number of minor issues affecting single algorithms, that have accumulated over the previous months. Existing applications should not be affected by this upgrade.
A larger 0.5.0 release is scheduled for early april with new algorithms, but also with API changes.
February 13, 2012, 16:51:35 0.4.0 Initial Announcement on mloss.org.
January 16, 2012, 22:12:23
Comments
No one has posted any comments yet. Perhaps you'd like to be the first?
Leave a comment
You must be logged in to post comments.