Some thoughts on Machine Learning Toolboxes
Posted by Mikio Braun on May 24, 2008
One popular format for an open source project in machine learning seems to be the creation of a complete toolbox, providing most of what you need for your everyday machine learning work. Often, such projects are not consciously started but just evolve out of the environment one constructs for oneself. Which is good, as it ensures that included features actually work and are relevant.
Often, such toolboxes use one of the new scripting languages, for example, python together with a scientific toolbox like scipy. PyML and Monte Python are two examples which can be found at mloss.org. The alternatives would either be using something like matlab, which already contains an enormous amount of support for numerical computations, or a compiled language like Java, in which you can build almost everything.
Actually, I think that using a scripting language like python is a
huge step in the right direction. All the raw computing power supplied
by matlab aside, the programming language used in matlab is already a
bit rusty. Well, I know that matlab provides some support for object
oriented programming, but the one-file-per-function rule really breaks
down when you start assembling objects. Pass-everything-by-value is
also quite a headache. (As I was just visiting their website, it seems
that they have cleaned up their OOP stuff a bit. But since I haven't
checked too closely, and for the sake of the argument, let's just
pretend they haven't
So, yes, scripting languages are great because we finally have much more powerful tools for modelling the computational processes which we work with. And an often overlooked fact (from the viewpoint of a toolbox designer) is that machine learning is not just about analyzing data, but also about developing new methods. Interestingly, we apply the same statistical methodology to evaluate methods which we also use to analyze raw data: for example, we assess the methods on resamples of the data when using cross-validation, and we apply statistical tests to see whether a method performs significantly better than the state-of-the-art.
In other words, machine learning research actually closes the loop between data and analyzer in the sense that the methods which we use to analyze data become the object of study themselves (and therefore also the target of statistical analysis). What this means for the underlying programming language is that it must have the capacity to treat methods as objects themselves.
The programming language beneath matlab basically goes as far as allowing function handles (if we forget the OOP part which has been bolted onto the language), but machine learning methods have much more structure than being a function which can take some arguments. For example, most methods have some additional parameters which have to be tuned to achieve good performance. But only if you can talk about a method and its parameters in a natural fashion, you can start to write something like a truly generic cross-validation methods, or a function which takes a bunch of methods and a data set and computes the table of numbers which allows us to compare the methods (and write papers).
All of this is simple in an object oriented language like python. We can implement methods as objects, not just as a collection of train/predict functions, and provide methods for querying all the interesting additional information, and then write methods which work with other methods. (Okay, I admit that this might also be possible in matlab. I have had ideas about how to parse the initial comment in a matlab script file to extract this kind of information, but let's just not go there... .)
I personally think scripting languages are also better prepared for building this kind of abstract framework as opposed to statically typed languages like Java. The reason is that the flexibility of the type system (or a complete lack thereof) allows us to build frameworks which are quite flexible and work with all kinds of objects as long as they provide the right interface. In Java (and I'm just taking it as an example), you would have to build explicit interface hierarchies, which easily results in elaborate class hierarchies containing literally hundreds of classes. In a loosely typed language, you can keep much of this stuff implicit which has the huge benefit that it requires so much less boilerplate to actually use the framework, write new methods and have the framework interact with your code.
One last point, before I take a look at the current state, an important consequence is that method related functions like cross-validation, or other kinds of evaluation procedures, should return an ordinary data set, not a type of object which contains the results, but really the same kind of data structure you use to analyze your usual data. These two steps: Methods as objects and storing the result of a methods assessment again in a data set truly close the loop, and will turn a data analysis toolbox into a machine learning research toolbox.
So what is the current state of the affair? From a quick glance at tutorials and documentation, it seems that most toolboxes are still in the stage where they try to suck in as many machine learning methods as possible, and provide mechanisms for building elaborate data analysis schemes from them. Which is perfectly okay with me, the whole framework described above without any data analysis methods would be pretty useless.
But there are also signs that people are starting to "unlearn" their matlab training and take advantage of the OOP modelling power. Just to name an example, the PyML toolbox provides generic assessment routines which take a classifier object and perform all kinds of data analysis steps on the method!. However, the resulting data is put into a data structure which is actually different from the "normal" data structure. But aside from this minor restriction, this is the direction of which I'd like to see more!
Leave a comment
You must be logged in to post comments.