Open Thoughts

May 2008 archive

Some thoughts on Machine Learning Toolboxes

May 24, 2008

One popular format for an open source project in machine learning seems to be the creation of a complete toolbox, providing most of what you need for your everyday machine learning work. Often, such projects are not consciously started but just evolve out of the environment one constructs for oneself. Which is good, as it ensures that included features actually work and are relevant.

Often, such toolboxes use one of the new scripting languages, for example, python together with a scientific toolbox like scipy. PyML and Monte Python are two examples which can be found at mloss.org. The alternatives would either be using something like matlab, which already contains an enormous amount of support for numerical computations, or a compiled language like Java, in which you can build almost everything.

Actually, I think that using a scripting language like python is a huge step in the right direction. All the raw computing power supplied by matlab aside, the programming language used in matlab is already a bit rusty. Well, I know that matlab provides some support for object oriented programming, but the one-file-per-function rule really breaks down when you start assembling objects. Pass-everything-by-value is also quite a headache. (As I was just visiting their website, it seems that they have cleaned up their OOP stuff a bit. But since I haven't checked too closely, and for the sake of the argument, let's just pretend they haven't :).)

So, yes, scripting languages are great because we finally have much more powerful tools for modelling the computational processes which we work with. And an often overlooked fact (from the viewpoint of a toolbox designer) is that machine learning is not just about analyzing data, but also about developing new methods. Interestingly, we apply the same statistical methodology to evaluate methods which we also use to analyze raw data: for example, we assess the methods on resamples of the data when using cross-validation, and we apply statistical tests to see whether a method performs significantly better than the state-of-the-art.

In other words, machine learning research actually closes the loop between data and analyzer in the sense that the methods which we use to analyze data become the object of study themselves (and therefore also the target of statistical analysis). What this means for the underlying programming language is that it must have the capacity to treat methods as objects themselves.

The programming language beneath matlab basically goes as far as allowing function handles (if we forget the OOP part which has been bolted onto the language), but machine learning methods have much more structure than being a function which can take some arguments. For example, most methods have some additional parameters which have to be tuned to achieve good performance. But only if you can talk about a method and its parameters in a natural fashion, you can start to write something like a truly generic cross-validation methods, or a function which takes a bunch of methods and a data set and computes the table of numbers which allows us to compare the methods (and write papers).

All of this is simple in an object oriented language like python. We can implement methods as objects, not just as a collection of train/predict functions, and provide methods for querying all the interesting additional information, and then write methods which work with other methods. (Okay, I admit that this might also be possible in matlab. I have had ideas about how to parse the initial comment in a matlab script file to extract this kind of information, but let's just not go there... .)

I personally think scripting languages are also better prepared for building this kind of abstract framework as opposed to statically typed languages like Java. The reason is that the flexibility of the type system (or a complete lack thereof) allows us to build frameworks which are quite flexible and work with all kinds of objects as long as they provide the right interface. In Java (and I'm just taking it as an example), you would have to build explicit interface hierarchies, which easily results in elaborate class hierarchies containing literally hundreds of classes. In a loosely typed language, you can keep much of this stuff implicit which has the huge benefit that it requires so much less boilerplate to actually use the framework, write new methods and have the framework interact with your code.

One last point, before I take a look at the current state, an important consequence is that method related functions like cross-validation, or other kinds of evaluation procedures, should return an ordinary data set, not a type of object which contains the results, but really the same kind of data structure you use to analyze your usual data. These two steps: Methods as objects and storing the result of a methods assessment again in a data set truly close the loop, and will turn a data analysis toolbox into a machine learning research toolbox.

So what is the current state of the affair? From a quick glance at tutorials and documentation, it seems that most toolboxes are still in the stage where they try to suck in as many machine learning methods as possible, and provide mechanisms for building elaborate data analysis schemes from them. Which is perfectly okay with me, the whole framework described above without any data analysis methods would be pretty useless.

But there are also signs that people are starting to "unlearn" their matlab training and take advantage of the OOP modelling power. Just to name an example, the PyML toolbox provides generic assessment routines which take a classifier object and perform all kinds of data analysis steps on the method!. However, the resulting data is put into a data structure which is actually different from the "normal" data structure. But aside from this minor restriction, this is the direction of which I'd like to see more!

Proposal for NIPS*08 workshop

May 22, 2008

We are planning to have a NIPS workshop this year again. After our last year's workshop proposal did not get accepted, we thought it is a good idea to publicly discuss our current proposal. We very much invite feedback and are looking for active co-organizers too!

Some thoughts on Open Data

May 15, 2008

At the end of last year, Science Commons announced the Protocol for Implementing Open Access Data which concerns the interoperability of scientific data. bbgm has summarized this into 10 points, of which I would like to focus on the first. Quoting bbgm:

Given the amount of legacy data, it is unlikely 
that a single license will work for scientific data. 
Therefore, the memo focuses principles for 
open access data and a protocol for 
implementing those principles.

Is licensing appropriate for scientific data?

The first knee jerk reaction is to say "Of course! It will protect different people's interests.". However, as pointed out by John Wilbanks, data available in the public domain cannot be made "more free" by licensing, only less. Quoting him:

The public domain is not an “unlicensed commons”. 
The public domain does not equal the BSD. 
It is not a licensing option.
It is the natural legal state of data.

There are several other opinions here and here, but at the end of the day, it is clear that open data is highly important for scientific research, and possibly even more important than open source. My personal view is that for machine learning, public domain seems to be the best for our data.

Taking this idea of "public domain" to the area of software, one can ask the question whether all academic software should be open source. I had the pleasure of spending a few days last week talking to Neil Lawrence and Carl Rasmussen. Neil seems to have software available for each paper that he has recently submitted available on a group webpage. Carl is one of the many people who has contributed to the Gaussian Processes website. The listed projects would be considered (I guess) public domain, or "freely available for academic use". Does it matter that these really useful pieces of software do not have explicit licensing? Should they be considering some form of license?

LWPR is the first application that made it into JMLR-MLOSS.

May 9, 2008

The Library for Locally Weighted Projection Regression or in short LWPR got accepted in JMLR. We would like to thank the authors for their effort and start to interlink and hi-light accepted JMLR submissions.

MLOSS progress updates for May 2008

May 8, 2008

Here is a bit of self advertising, and a development in the bioinformatics community...

We have, as of today, 68 software projects and 205 registered users on the site (http://mloss.org). What surprised me is the breadth of languages that machine learners seem to write their software. A look at the list of languages revealed that most of the popular languages are represented in our list of mloss projects.

  • C, C++
  • clisp, java
  • matlab, octave
  • python, perl
  • R, ruby

Comparing with the most popular programming languages on TIOBE notable languages that are missing include

  • visual basic, php
  • c#, d, delphi
  • javascript

One can argue that many of these languages are more suited to web development than machine learning code, but c# and delphi are general purpose languages. Maybe the fact that they are strongly linked with Microsoft has scared away open source developers from those languages.

In a discussion post, I pointed out that the International Society for Computational Biology was finalizing a policy statement about software sharing, and they recommend open source software. The relevant section says

III. Implementation when software sharing is warranted

  1. In most cases, it is preferable to make source code available. We recommend executable versions of the software should be made available for research use to individuals at academic institutions.
  2. Open source licenses are one effective way to share software.
    For more information, see the definition of open source, and example licenses, at www.opensource.org.

For the bioinformatics community, this means that researchers can more easily justify to the powers that be that open source is the right way to go. Will the machine learning community follow?