libraries vs scripts
Posted by Cheng Soon Ong on February 18, 2011
Structuring software projects is one of the major challenges in computer science. For machine learning research, software should be easy to use yet flexible. One dimension which I've found practically useful is the library versus script direction. Basically, computations are hidden away behind interfaces, which separate the library from the script.
Any software is essentially a sequence of commands, which are executed in order, to produce the desired machine learning result. However, human beings are particularly bad at dealing with large unstructured sequences (think spaghetti code), and so it is often useful to abstract away the details behind an interface. I am not going to get into the debate about what is the "right way" to perform abstraction, but am just going to use object orientation of a classifier as an example. This gives us the following toy example:
Library (abstract base class: Classifier)
- kNN
- NN
- SVM
- RF
Interface (i.e. each class should implement the following)
- train
- predict
Scripts
- compute k-fold cross validation
- collect and summarise results
One question that already is apparent from this simple example is whether cross validation should be part of the library or the script. I consider the library the reusable part of code, and the script as the customization part. In essence, the script is the code that runs my library and produces the results that I can cut and paste into papers. This involves code for plotting, generating LaTeX tables, etc. So, as my code evolves, and it turns out that I use something across different papers, it migrates from the script side to the library side. So, my working definition of what goes into the script and what into the library is by looking at whether it is reused.
One advantage of structuring my code this way is that the scripts are somehow "use cases" for the library. They provide examples as to what the library interface means, and how it should be used. This natural side effect of reproducible computational results also provide a (weak) test case for future changes to the library.
Interestingly, even though I use cross validation all the time to tune hyperparameters, it has resisted all my attempts to be part of the library. I have many different versions of cross validation all over my code base. Quite irritating really, but I haven't been able to find an abstraction that works for all the different types of parameters that I tune, such as features to choose, normalization, regularization (of course), etc. Anbody have a good suggestion?
Comments
-
- Cheng Soon Ong (on April 11, 2011, 16:11:50)
Thanks for the pointer.
How would a generic API look like? Something like this?
cross_val(samples, group, predictor, success, fixed_param, opt_param)
where:
- samples are the examples and labels
- group is the variable you mentioned, to say which validation set an example belongs to
- predictor is some classifier, which will need some fixed API, e.g. predictor.train and predictor.predict
- fixed_param and opt_param are some sort of complicated structure that together determine how the parameters of predictor is set. The difference is that we optimize over the set opt_param, and keep fixed_param fixed.
What I have to keep rewriting again and again is a piece of code that converts from some set of parameters to an initialization of a classifier. To make things a bit more concrete, consider two classifiers: svm and knn.
opt_param for svm could be something like the regularization parameter and kernel function. fixed_param could be the stopping condition for training.
opt_param for knn could be the number of neighbours k, and fixed_param could be the definition of how the graph is built, e.g. symmetric neighbours.
Do you think such code is possible? Does XVAL already do this?
Regards, Cheng
-
- Alois Schloegl (on April 12, 2011, 23:49:19)
No, XVAL does only cross-validation, nested cross-validation (including training-test-validation set) is not included. In case of the need for such an approach, I'd select the validation set by hand and a priori; XVAL can be applied to the training-test sets for optimizing the hyperparameters.
Before adding support for a permutation of the validation set, one should think twice or three times whether the possible advantages can really out-weight the costs (possible overfitting, spurious gain, additional complexity, increased computation effort, reduced training data because of the need for a validation set).
Such a code would be possible, but it should provide a metric that shows whether the results from different hyperparameters are significant different or not. Otherwise, it is not possible to tell whether the gain is spurious or not. Ideally, the method will also show for which range of hyperparameters, the differences in the classification results are not statistically significant, and whether this range is clear-cut and well-defined or disrupted.
Cheers, Alois
Leave a comment
You must be logged in to post comments.
Within the NaN-toolbox, I use a function XVAL for doing the cross-validation. It uses an additional 'group' value assigned to each sample, which determines the kind of cross-validation that is applied. If the group values are all different (e.g. run from [1:N]), it results in a Leave-One-Out-Method. g = ceil([1:N]/K)) results in a Leave-K-out method, g = ceil([1:N)/N*2) results in split-half, and so on. In this way, it is possible to apply any L-fold cross-validation scheme.
Besides cross-validation, this group value is also useful for data, were groups of samples can not be considered independent; dependent samples get the same group values, and it can be guaranteed that all samples from a single group are either in the test - or in the training set, and not mixed (which could cause overfitting).
I found this approach very useful, maybe its also useful to you.
Alois
P.S.: so far, xval does not support permutations (like in K-times L-fold XV), but if needed the concept could also be extended to support permutations.