
 Description:
ELKI: "Environment for Developing KDDApplications Supported by IndexStructures" is a development framework for data mining algorithms written in Java. It includes a large variety of popular data mining algorithms, distance functions and index structures.
Its focus is particularly on clustering and outlier detection methods, in contrast to many other data mining toolkits that focus on classification. Additionally, it includes support for index structures to improve algorithm performance such as R*Tree and MTree.
The modular architecture is meant to allow adding custom components such as distance functions or algorithms, while being able to reuse the other parts for evaluation.
This package also includes the source code, since this software is meant for the rapid development of such algorithms, not so much for end users.
 Changes to previous version:
Additions and Improvements from ELKI 0.6.0:
Uncertain data types, and clustering algorithms for uncertain data.
Major refactoring of distances  removal of Distance values and removed support for nondoublevalued distance functions. While this reduces the generality of ELKI, we could remove about 2.5% of the codebase by not having to have optimized codepaths for doubledistance anymore. Generics for distances were present in almost any distancebased algorithm, and we were also happy to reduce the use of generics this way. Support for nondoublevalued distances can trivially be added again, e.g. by adding the specialization one level higher: at the query instead of the distance level, for example.
In this process, we also removed the Generics from NumberVector. The objectbased get was deprecated for a good reason long ago, and e.g. doubleValue are more efficient (even for nonDoubleVectors).
Dropped some longdeprecated classes
Clustering algorithms:
Kmeans
 speedups for some initialization heuristics
 Kmeans++ initialization no longer squares distances (again)
 farthestpoint heuristics now uses minimum instead of sum (renamed)
 additional evaluation criteria
 Elkan's and Hamerly's faster kmeans variants
CLARA clustering
Xmeans
Hierarchical clustering
 Renamed naive algorithm to AGNES
 Anderbergs algorithm (faster than AGNES, slower than SLINK)
 CLINK for complete linkage clustering in O(n²) time, O(n) memory
 Simple extraction from HDBSCAN
 "Optimal" extraction from HDBSCAN
 HDBSCAN, in two variants
LSDBC clustering
EM clustering was refactored and moved into its own package. The new version is much more extensible.
Parallel computation framework, and some parallelized algorithms
 Parallel kmeans
 Parallel LOF and variants
Input:
 LibSVM format parser
Classification:
 kNN classification (with index acceleration)
Evaluation: Internal cluster evaluation:
 Silhouette index
 Simplified Silhouette index (faster)
 DavisBouldin index
 PBM index
 VarianceRatioCriteria
 Sum of squared errors
 CIndex
 Concordant pair indexes (Gamma, Tau)
 Different noise handling strategies for internal indexes
Statistical dependence measures:
 Distance correlation dCor.
 Hoeffings D.
 Some divergence / mutual information measures.
Distance functions:
 Big refactoring.
 Time series distances refactored, allow variable length series now.
 Hellinger distance and kernel function.
Preprocessing:
 Faster MDS implementation using power iterations.
Indexing improvements:
 Precomputed distance matrix "index".
 iDistance index (static only).
 Invertedlist index for sparse data and cosine/arccosine distance.
 cover tree index (static only).
Frequent Itemset Mining:
 Improved APRIORI implementation.
 FPGrowth added.
 Eclat (basic version only) added.
Uncertain clustering:
 Discrete and continuous data models
 FDBSCAN clustering
 UKMeans clustering
 CKMeans clustering
 Representative Uncertain Clustering (Metaalgorithm)
 Centerofmass meta Clustering (allows using other clustering algorithms on uncertain objects) (KDD'14)
Outlier detection changes / smaller improvements:
 KDEOS outlier detection (SDM14)
 kmeans based outlier detection (distance to centroid) and Silhouette coefficient based approach (which does not work too well on the toy data sets  the lowest silhouette are usually where two clusters touch).
 bug fix in kNN weight, when distances are tied and kNN yields more than k results.
 kNN and kNN weight outlier have their k parameter changed: old 2NN outlier is now 1NN outlier, as commonly understood in classification literature (1 nearest neighbor ''other than the query object''; whereas in database literature the 1NN is usually the query object itself). You can get the old result back by decreasing k by one easily.
 LOCI implementation is now only O(n^3 log n) instead of O(n^4).
Various:
MiniGUI has two "secret" new options: minigui.last minigui.autorun to load the last saved configuration and run it, for convenience.
Logging API has been extended, to make logging more convenient in a number of places (saving some lines for progress logging and timing).
 BibTeX Entry: Download
 Corresponding Paper BibTeX Entry: Download
 Supported Operating Systems: Platform Independent
 Data Formats: Arff, Other, Csv, Parser Extension Api
 Tags: Clustering, Visualization, Algorithms, Evaluation, Anomaly Detection, Outlier Detection, Index Structures
 Archive: download here
Comments
No one has posted any comments yet. Perhaps you'd like to be the first?
Leave a comment
You must be logged in to post comments.