February 2011 archive
libraries vs scripts
February 18, 2011
Structuring software projects is one of the major challenges in computer science. For machine learning research, software should be easy to use yet flexible. One dimension which I've found practically useful is the library versus script direction. Basically, computations are hidden away behind interfaces, which separate the library from the script.
Any software is essentially a sequence of commands, which are executed in order, to produce the desired machine learning result. However, human beings are particularly bad at dealing with large unstructured sequences (think spaghetti code), and so it is often useful to abstract away the details behind an interface. I am not going to get into the debate about what is the "right way" to perform abstraction, but am just going to use object orientation of a classifier as an example. This gives us the following toy example:
Library (abstract base class: Classifier)
- kNN
- NN
- SVM
- RF
Interface (i.e. each class should implement the following)
- train
- predict
Scripts
- compute k-fold cross validation
- collect and summarise results
One question that already is apparent from this simple example is whether cross validation should be part of the library or the script. I consider the library the reusable part of code, and the script as the customization part. In essence, the script is the code that runs my library and produces the results that I can cut and paste into papers. This involves code for plotting, generating LaTeX tables, etc. So, as my code evolves, and it turns out that I use something across different papers, it migrates from the script side to the library side. So, my working definition of what goes into the script and what into the library is by looking at whether it is reused.
One advantage of structuring my code this way is that the scripts are somehow "use cases" for the library. They provide examples as to what the library interface means, and how it should be used. This natural side effect of reproducible computational results also provide a (weak) test case for future changes to the library.
Interestingly, even though I use cross validation all the time to tune hyperparameters, it has resisted all my attempts to be part of the library. I have many different versions of cross validation all over my code base. Quite irritating really, but I haven't been able to find an abstraction that works for all the different types of parameters that I tune, such as features to choose, normalization, regularization (of course), etc. Anbody have a good suggestion?