
 Description:
This software is written in C++ and contains routines for statistical classification, probability estimation and interpolation/nonlinear regression. Two variable bandwidth kernel methods are adopted: knearest neighbour (KNN), and a balloon estimator based on Gaussian kernels, hence Adaptive Gaussion Filtering (AGF). A library of easytouse, singlecall functions (you call a single function once for each estimateno initialization required) are included, as well as commandline executables.
The statistical classification routines are particularly powerful, allowing you to generate a pretrained model by searching for the class borders. These can then be used to make rapid classifications which nonetheless return estimates of the conditional probabilities.
Clustering routines are a recent addition.
 Changes to previous version:
New in version 0.92:
In the direct classification routines (classify_a, classify_knn), there is now an option (j) to print out joint probabilities instead of conditional probabilities. Of course this can be done by calculating the total probability and multiplying by the conditional probability, but this means redundant calculation.
In class_borders, added the option (r) to solve for a class border other than at R=0. This is useful if your classes are of significantly different size, especially when the training data does not reflect this.
There is now a simple clustering analysis program (cluster_knn) based on a threshold density. It works by first finding a point in which the density is greater than this threshold. Using the knearest neighbours to this point, it recursively finds all other points above this threshold and assigns them the same class number.
The option to use a metric other than Cartesian now exists. Since many of the calculations are specifically based on a Cartesian space, especially the PDF estimation, this should be applied with some caution.
Option for different names for files containing normalization data. It's a pretty minor point, so it's only been implemented in two or three programs, chiefly the class_borders and classify_b modules. I'm too lazy to do them all...
Added an nfold crossvalidation program that works with all the classification algorithms.
Added a small utility that just normalizes the data and thats it. Also cleaned up and properly renamed a utility (vecfile2lvq) to convert the binary files to Kohonen's LVQ format.
New in Version 0.9.3:
The libpetey library is no longer part of the libagf distribution
The class borders codes can no longer generate duplicate samples. There are two versions: one for large training datasets, and on for small. If all combinations of pairs of training samples have been used up, the codes will generate no more training samples.
New in Version 0.9.4:
Most importantly, everything, except the IO routines, has been templated. This means you can do your work in single or double precision and you can represent your classes as bytes, 8bit integers, 16bit integers, 32 bit integers, etc.  whatever size you want.
With the exception of those used in external routines, variable types in the main routines are now controlled with global typedefs, with each class of variable having a different type. This means you can tightly control the typing for optimal use of space or CPU cycles. Classes have a default type of 32bit integers while floating point operations are done in single precision by default.
Different metrics are now only supported in the routines where they make sense: KNN classification and KNN interpolation. The functions now require a pointer to the desired metric.
nfold routine now supports interpolation. Note that this is still not well test (if at all).
File conversion utilities as well as the test class routines have now been integrated into the main distribution simply by more linking the two makefiles more closely, thus allowing easier testing and more userfriendly files.
A routine that performs AGF PDF estimation with an optimal error rate is currently being tested but is not ready yet. We hope to have it ready in a new release very shortly.
Also in the next release: multiclass classification using the classborders method. Stay tuned!
 BibTeX Entry: Download
 Corresponding Paper BibTeX Entry: Download
 URL: Project Homepage
 Supported Operating Systems: Agnostic
 Data Formats: Ascii, Binary
 Tags: Clustering, Nonparametric Density Estimation, Supervised Learning, Interpolation, Inverse Methods, Kernel Estimation, Nonlinear Regression, Probability Estimation, Statistical Classification
 Archive: download here
Other available revisons

Version Changelog Date 0.9.8 New in Version 0.9.8:
bug fixes: svm file conversion works properly and is more general
nonhierarchical multiborders has 3 options for solving for the conditional probabilities: matrix inversion, voting, and matrix inversion overridden by voting, with renormalization
multiborders now works with external binary classifiers
random numbers resolve a tie when selecting classes based on probabilities
pair of routines, sort_discrete_vectors and search_discrete_vectors, for classification based on nd binning (still experimental)
command options have been changed with many new additions, see QUICKSTART file or run the relevant commands for details
December 6, 2014, 02:35:39 0.9.7 New in Version 0.9.7:
 multiclass classification generalizes classborders algorithm using a recursive control language
 hierarchical clustering
 improved preprocessing
April 15, 2014, 04:55:41 0.9.6 New in Version 0.9.6:
 crossvalidation of pdf estimates
 computation of relative operating characteric (ROC) curves
December 12, 2012, 03:37:48 0.9.5 New in Version 0.9.5:
Sadly, neither the multiclass classifier using the "borders" method, nor the optimal AGF routine have been perfected yet. However, there are quite a few other good improvements to sweeten the mix...
The routine for finding the knearestneighbours has been changed from one based on a binary tree to one based on a quicksort algorithm. Speed improvements are expected to be on the order of 25%. To change back to the old version, use the macro, KLEAST_FUNC, in the agf_defs.h include file.
The routine for calculating the weights for the AGF algorithm now matches the filter variance to the W parameter using the supernewton rootfinding algorithm instead of by squaring the initial weights. This means that there are now two bounds for the filter variance. They are set by the v and V options for the lower and upper bounds respectively. Since it is trivial to push the bounds outward if they do not bracket the root and since these changes are "sticky" it does not matter if the high bound is too low or the low bound too high. Rather the user should try to avoid the opposite extreme as this will mean a larger number of iterations to reach the root. Default bounds are [sigma^2/n^(2/D), sigma^2] where sigma^2 is the total variance of the data.
The new weightcalculating routine is more accurate and should be more robust as well, although at the cost of a slight speed penalty. As with the kleast subroutine, however, the old version can be reinstated by changing the AGF_CALC_W_FUNC macro. The intial filter variance, since it is an upper bound, is now set with the V option instead of the v option.
For maximum control of the weightcalculating routine, several new options have been added. To change the maximum number of iterations in the supernewton rootfinding algorithm, use the I option. This changes it for both calculation of weights and for searching for the class borders. To change it for one or the other, use i for the weight calculation routine and h for the class borders routine. The default number of iterations for both is 100 which may not be sufficient for some problems.
To change the tolerance of W, or the total of the weights, use the l option. Default is 0.005 which should be more than sufficient. Since the accuracy of W is not that critical, the tolerance can be degraded, probably as high as 1, for a slight speed savings.
The parameter W is now set with the W option (uppercase doubleu) instead of the w option (lowercase doubleu).
The optimal AGF may not work yet, but it's a lot more user friendly! Check the documentation...
September 14, 2012, 22:20:41 0.9.4 New in version 0.92:
In the direct classification routines (classify_a, classify_knn), there is now an option (j) to print out joint probabilities instead of conditional probabilities. Of course this can be done by calculating the total probability and multiplying by the conditional probability, but this means redundant calculation.
In class_borders, added the option (r) to solve for a class border other than at R=0. This is useful if your classes are of significantly different size, especially when the training data does not reflect this.
There is now a simple clustering analysis program (cluster_knn) based on a threshold density. It works by first finding a point in which the density is greater than this threshold. Using the knearest neighbours to this point, it recursively finds all other points above this threshold and assigns them the same class number.
The option to use a metric other than Cartesian now exists. Since many of the calculations are specifically based on a Cartesian space, especially the PDF estimation, this should be applied with some caution.
Option for different names for files containing normalization data. It's a pretty minor point, so it's only been implemented in two or three programs, chiefly the class_borders and classify_b modules. I'm too lazy to do them all...
Added an nfold crossvalidation program that works with all the classification algorithms.
Added a small utility that just normalizes the data and thats it. Also cleaned up and properly renamed a utility (vecfile2lvq) to convert the binary files to Kohonen's LVQ format.
New in Version 0.9.3:
The libpetey library is no longer part of the libagf distribution
The class borders codes can no longer generate duplicate samples. There are two versions: one for large training datasets, and on for small. If all combinations of pairs of training samples have been used up, the codes will generate no more training samples.
New in Version 0.9.4:
Most importantly, everything, except the IO routines, has been templated. This means you can do your work in single or double precision and you can represent your classes as bytes, 8bit integers, 16bit integers, 32 bit integers, etc.  whatever size you want.
With the exception of those used in external routines, variable types in the main routines are now controlled with global typedefs, with each class of variable having a different type. This means you can tightly control the typing for optimal use of space or CPU cycles. Classes have a default type of 32bit integers while floating point operations are done in single precision by default.
Different metrics are now only supported in the routines where they make sense: KNN classification and KNN interpolation. The functions now require a pointer to the desired metric.
nfold routine now supports interpolation. Note that this is still not well test (if at all).
File conversion utilities as well as the test class routines have now been integrated into the main distribution simply by more linking the two makefiles more closely, thus allowing easier testing and more userfriendly files.
A routine that performs AGF PDF estimation with an optimal error rate is currently being tested but is not ready yet. We hope to have it ready in a new release very shortly.
Also in the next release: multiclass classification using the classborders method. Stay tuned!
November 27, 2011, 06:25:56 0.92 Initial Announcement on mloss.org.
April 30, 2010, 05:43:32
Comments

 Peter Mills (on April 15, 2014, 04:55:05)
Multiborders classification is now ready. I am very pleased (and pleasantly surprised) with how well it works.
Leave a comment
You must be logged in to post comments.
I had hoped to have multiclass borderclassification ready by now, but the simple generalization I had envisioned to implement it won't work in all cases. The idea was to use matrix inversion to solve for the conditional probabilities, but quite obviously (in retrospect) you can solve for the class without being able to determine all the conditional probabilities. Likely we need two cases: one where all the conditional probabilities can be found, and one where only that of the retrieved class can be found and these two cases need to interoperate. A recursive or hierarchical model would seem to be the best solution here.
I realize that there is literature relating to the problem of creating multiclass classifications from twoclass, however I do not currently have access to commercial journals as I am not affiliated with an academic or research institution. It is also an enjoyable challenge to try and figure these things out for yourself, from scratch, so to speak.
Likewise I had hoped to have the optimalbandwidth Gaussian PDF estimation ready. I had made some progress on it, but the test cases were not giving consistent results and I have failed to work on it in the intervening months.