Open Thoughts

On our NIPS workshop

Posted by Mikio Braun on December 19, 2008

On December 12, our third workshop on machine learning open source software was held in Whistler, BC, Canada. It featured two invited speakers, a host of new and exciting software projects on machine learning, and two interesting discussions where we tried to initiate new developments.

We were very glad to get two speakers from very prestigious and successful projects: John W. Eaton from octave, a matlab clone, and John D. Hunter from matplotlib, a plotting library for matlab.

John W. Eaton gave valuable insights into his experiences in running an open source project. Started in 1989 as companion software to a book on chemical reactions, its main intention was to give students something which is more accessible than Fortran. Only afterwards were people realizing that octave was very close to matlab, and over the years people were requesting better and better compatibility to matlab. The last major release, version 3, has brought even fuller compatibility with the support of sparse matrices, and a complete overhaul of the plotting functionalities. Still, octave is searching for help, in particular in the areas of documentation, mailing list maintenance, and packaging. So if you're interested, drop John a line.

Matplotlib by John D. Hunter also started as a private project. John worked on epilepsy research in neurophysiology and initially wrote what would become matplotlib to display brain waves together with related data. At some point, matplotlib has started to become so big that it practially required all of his time. By now, John is working in the finance industry but has an agreement with his employer to work on matplotlib a certain fraction of his time. We have also learned that matplotlib contains a full re-implementation of the TeX algorithms by Donald Knuth for rendering annotations in the plot.

Both speakers stressed the importance of being resilient and pointed out that they both had to go through some time (might even be years) before a project really takes off. Both also shared their insights on how difficult it can be to deal with users. On the one hand, you have to be reliable to build up trust in your project, on the other hand, there are always some users who expect full support basically for free and are unwilling to contribute.

Besides those two invited talks, we again had a number of interesting projects. The submissions this year could roughly be classified into full frameworks, projects which focus on a special type of application or algorithm, and infrastructure.

We had four different projects which are providing a full-blown environment for doing machine learning and statistical data analysis. The first talk was Torch, a full blown matlab-replacement written in a combination of Lua and C++. Torch is optimized for efficiency and large scale learning and comes with its own matrix classes (called tensors), and plotting routines. Shark is a similarly feature rich framework written in C++. For users of R, there is [kernlab][ker] which focuses on kernel methods. Finally, python was represented by mlpy, and mdp, which sported an innovative module architecture which allows to plug together data processing modules. It was very interesting to see that there exist so many different projects which have such a broad scope. It was also quiet interesting to learn that these projects weren't so much aware of one another.

Projects which were more focused on a smaller scale included Nieme, which contains algorithms for energy-based learning, libDAI, a library of inference algorithms for graphical models with discrete state spaces, and Model Monitor, a tool for assessing the amount of distribution shift in the data and sensitivity of algorithms under distribution shift. The BCPy project again provides a python layer over the BCI2000 system and allows to work with the later in a much more flexible manner.

Finally, we had projects which dealt with different aspects of infrastructure. The RL Glue project provides a general framework to connect environments and learners in a reinforcement learning framework. This project has been highly successful, and is the standard platform for a number of challenges in this area. Disco implements the map-reduce framework for distributed clustering in a particularly elegant manner for python users, based on a core is written in Erlang. The Experiment Databases for Machine learning and BenchMarking Via Weka projects address the issue of benchmarking machine learning algorithms in an automatic and reproducible way and providing a database to describe models and experimental results.

In summary, it seems that researchers are quite active in providing feature-rich high-quality open source software on machine learning. The large number of 23 submissions to this workshop also provides evidence for that. At the same time, it seems that most projects are still oblivious of each other. In particular, when it comes to interoperability, it seems that there is still a lot missing, making it hard to combine algorithms written in different languages, or code developed with respect to different frameworks.

Therefore, one of the discussions was focused on the question of interoperability. As a starting point, we proposed the ARFF file format as a common file format for exchanging data. Such a file format could serve as an important first step. Leaving more complex solutions like remote method invocation or CORBA aside, a common data format is really the simplest way to exchange data between two pieces of code which might be written in different languages or run on different platforms. As we expected, the discussion was quite lively, as the number of possible data formats is large, and the different features you could want are not always compatible. But I think what we achieved was to raise awareness for the need of interoperability. Hopefully, people will start to think about how their code could interact with other code, and standards will emerge over time.

The other discussion addressed an even more difficult question, namely that of reproducibility. How can we ensure that somebody else can reproduce the experimental results from a machine learning paper? An interesting suggestion was to require that the software producing the results is provided on a bootable live CD like a Ubuntu install CD to really make sure that the environment in which the experiments were done can be set up easily. The question was also whether you want to be able to reproduce the results at publication time, or even after ten years. Again, there is the problem of how to describe and store results in a database. Here also, we did not arrive at a conclusion, but the overall awareness could be raised hopefully.

Overall, I think the workshop was very successful and interesting. Room for improvement is always there, of course. For example, we should make sure not to forget to schedule coffee breaks next time. Also, I think we should put more emphasis on the community building aspect and less on individual projects. In 2006, the topic was so new that people didn't know what kinds of projects were out there, but now, also due to this website, the existence of open source software for machine learning is much more known. So giving projects a platform to advertise their software is certainly an important part, but thinking about what the next step is and talking about how to integrate what we already have is something I would put more emphasis on next time.

Again I (and Soeren and Cheng as well) would like to thank everybody who contributed to this workshop, and of course also the Pascal2 framework for their financial support.


No one has posted any comments yet. Perhaps you'd like to be the first?

Leave a comment

You must be logged in to post comments.