Open Thoughts

MLOSS workshop at ICML 2015: Open Ecosystems

Posted by Antti Honkela on March 30, 2015

MLOSS workshops are returning to ICML this summer!

Key dates:

  • Submission DL 28 April 2015
  • Workshop date 10 July 2015

We (Gaël Varoquaux, Cheng Soon Ong and Antti Honkela) are organising another MLOSS workshop at ICML 2015 in Lille, France this July. The theme for this edition is "Open Ecosystems" whereby we wish to invoke discussion on benefits (or drawbacks?) of multiple tools in the same ecosystem. Our invited speakers (John Myles White and Matthew Rocklin) will share some of their experiences on Julia and Python, and we would be happy to hear from others either on the same or different ecosystems through contributed talks. Usual demonstrations of new great software are naturally also welcome!

In addition to the talks, we have planned two more active sessions:

  • an open discussion with themes voted by workshop participants similar to MLOSS 2013; and
  • a hackathon for planning and starting to develop infrastructure for measuring software impact.

If you have any comments or suggestions regarding these, please add a comment here or email the organisers!

More details at the workshop website at

A third of the top 100 papers are about software

Posted by Cheng Soon Ong on October 30, 2014

How many of the papers that are in the top 100 most cited about software?

21, with an additional 12 papers which are not specifically about software itself, but about methods or statistics that were implemented later in software. When you take a step back and think about the myriad areas of research and the stratospheric numbers of citations the top 100 get, it is quite remarkable that one fifth of the papers are actually about software. I mean really about software, not software as an afterthought. Some examples:

To put in perspective how rarified the air is in the top 100 citations, the if we combined all citations received by all JMLR papers in the last five years (according to SCImago), this one gigantic paper would not even make it into the top 100.

Yes, yes, citations do not directly measure the quality of the paper, and there are size of community effects and all that. To be frank, being highly cited seems to be mostly luck.

In the spirit of open science, here is a bar plot showing these numbers, and here is my annotated table which I updated from the original table. For a more mainstream view of the data, look at the Nature article.

Open Machine Learning Workshop

Posted by Cheng Soon Ong on July 28, 2014

Just in case there are people who follow this blog but not John Langford's, there is going to be an open machine learning workshop on 22 August 2014 in MSR, New York, organised by Alekh Agarwal, Alina Beygelzimer, and John Langford.

As it says on John's blog: If you are interested, please email msrnycrsvp at and say “I want to come” so we can get a count of attendees for refreshments.

Machine Learning Distro

Posted by Cheng Soon Ong on July 22, 2014

What would you include a linux distribution to customise it for machine learning researchers and developers? Which are the tools that would cover the needs of 90% of PhD students who aim to do a PhD related to machine learning? How would you customise a mainstream linux distribution to (by default) include packages that would allow the user to quickly be able to do machine learning on their laptop?

There are several communities which have their own custom distribution:

  • Scientific Linux which is based on Red Hat Enterprise Linux is focused making it easy for system administrators of larger organisations. The two big users are FermiLab and CERN who each have their own custom "spin". Because of its experimental physics roots, it does not have a large collection of pre-installed scientific software, but makes it easy for users to install their own.
  • Bio-Linux is at the other end of the spectrum. Based on Ubuntu, it aims to provide a easy to use bioinformatics workstation by including more than 500 bioinformatics programs, including graphical menus to them and sample data for testing them. It is targeted at the end user, with simple instructions for running it Live from DVD or USB, to install it, and to dual boot it.
  • Fedora Scientific is the latest entrant, providing a nice list of numerical tools, visualisation packages and also LaTeX packages. Its documentation lists packages for C, C++, Octave, Python, R and Java. Version control is also not forgotten. A recent summary of Fedora Scientific was written as part of Open Source Week.

It would seem that Fedora Scientific would satisfy the majority of machine learning researchers, since it provides packages for most things already. Some additional tools that may be useful include:

  • tools for managing experiments and collecting results, to make our papers replicable
  • GPU packages for CUDA and OpenCL
  • Something for managing papers for reading, similar to Mendeley
  • Something for keeping track of ideas and to do lists, similar to Evernote

There's definitely tons of stuff that I've forgotten!

Perhaps a good way to start is to have the list of package names useful for the machine learning researcher in some popular package managers such as yum, apt-get, dpkg. Please post your favourite packages in the comments.

Google Summer of Code 2014

Posted by Cheng Soon Ong on June 3, 2014

GSoC 2014 is between 19 May and 18 August this year. The students should now be just sinking their teeth into the code, and hopefully having a lot of fun while gaining invaluable experience. This amazing program is in its 10th year now, and it is worth repeating how it benefits everyone:

  • students - You learn how to write code in a team, and work on projects that are long term. Suddenly, all the software engineering lectures make sense! Having GSoC in your CV really differentiates you from all the other job candidates out there. Best of all, you actually have something to show your future employer that cannot be made up.

  • mentors - You get help for your favourite feature in a project that you care about. For many, it is a good introduction to project management and supervision.

  • organisation - You recruit new users and, if you are lucky, new core contributors. GSoC experience also tends to push projects to be more beginner friendly, and to make it easier for new developers to get involved.

I was curious about how many machine learning projects were in GSoC this year and wrote a small ipython notebook to try to find out.

Looking at the organisations with the most students, I noticed that the Technical University Vienna has come together and joined as a mentoring organisation. This is an interesting development, as it allows different smaller projects (the titles seem disparate) to come together and benefit from a more sustainable open source project.

On to machine learning... Using a bunch of heuristics, I tried to identify machine learning projects from the organisation name and project titles. I found more than 20 projects with variations of "learn" in them. This obviously misses out projects from R some of which are clearly machine learning related, but I could not find a rule to capture them. I am pretty sure I am missing others too. I played around with some topic modelling, but this is hampered by the fact that I could not figure out a way to scrape the project descriptions from the dynamically generated list of project titles on the GSoC page.

Please update the source with your suggestions!

Reproducibility is not simple

Posted by Cheng Soon Ong on March 30, 2014

There has been a flurry of articles recently outlining 10 simple rules for X, where X has something to do with data science, computational research and reproducibility. Some examples are:

Best practices

These articles provide a great resource to get started on the long road to doing "proper science". Some common suggestions which are relevant to practical machine learning include:

Use version control

Start now. No, not after your next paper, do it right away! Learn one of the modern distributed version control systems, git or mercurial currently being the most popular, and get an account on github or bitbucket to start sharing. Even if you don't share your code, it is a convenient offsite backup. Github is the most popular for open source projects, but bitbucket has the advantage of free private accounts. If you have an email address from an educational institution, you get the premium features for free too.

Distributed version control systems can be conceptually daunting, but it is well worth the trouble to understand the concepts instead of just robotically type in commands. There are numerous tutorials out there, and here are some which I personally found entertaining, git foundations and hginit. For those who don't like the command line, have a look at GUIs such as sourcetree, tortoisegit, tortoisehg, and gitk. If you work with other people, it is worth learning the fork and pull request model, and use the gitflow convention.

Please add your favourite tips and tricks in the comments below!

Open source your code and scripts

Publish everything. Even the two lines of Matlab that you used to plot your results. The readers of your NIPS and ICML papers are technical people, and it is often much simpler for them to look at your Matlab plot command than to parse the paragraph that describes the x and y axes, the meaning of the colours and line types, and the specifics of the displayed error bars. Tools such as ipython notebooks and knitr are examples of easy to implement literate programming frameworks that allow you to make your supplement a live document.

It is often useful to try to conceptually split your computational code into "programs" and "scripts". There is no hard and fast rule for where to draw the line, but one useful way to think about it is to contrast code that can be reused (something to be installed), and code that runs an experiment (something that describes your protocol). An example of the former is your fancy new low memory logistic regression training and testing code. An example of the latter is code to generate your plots. Make both types of code open, document and test them well.

Make your data a resource

Your result is also data. When open data is mentioned, most people immediately conjure images of the inputs to prediction machines. But intermediate stages of your workflow are often left out of making things available. For example, if in addition to providing the two lines of code for plotting, you also provided your multidimensional array containing your results, your paper now becomes a resource for future benchmarking efforts. If you made your precomputed kernel matrices available, other people can easily try out new kernel methods without having to go through the effort of computing the kernel.

Efforts such as and provide useful resources to host machine learning oriented datasets. If you do create a dataset, it is useful to get an identifier for it so that people can give you credit.

Challenges to open science

While the articles call these rules "simple", they are by no means easy to implement. While easy to state, there are many practical hurdles to making every step of your research reproducible .

Social coding

Unlike publishing a paper, where you do all your work before publication, publishing a piece of software often means that you have to support it in future. It is remarkably difficult to keep software available in the long term, since most junior researchers move around a lot and often leave academia altogether. It is also challenging to find contributors that can help out in stressful periods, and to keep software up to date and useful. Open source software suffers from the tragedy of the commons, and it quickly becomes difficult to maintain.

While it is generally good for science that everything is open and mistakes are found and corrected, the current incentive structure in academia does not reward support for ongoing projects. Funding is focused on novel ideas, publications are used as metrics for promotion and tenure, and software gets left out.

The secret branch

When developing a new idea, it is often tempting to do so without making it open to public scrutiny. This is similar to the idea of a development branch, but you may wish to keep it secret until publication. The same argument applies for data and results, where there may be a moratorium. I am currently unaware of any tools that allow easy conversion between public and private branches. Github allows forks of repositories, which you may be able to make private.

Once a researcher gets fully involved in an application area, it is inevitable that he starts working on the latest data generated by his collaborators. This could be the real time stream from Twitter or the latest double blind drug study. Such datasets are often embargoed from being made publicly available due to concerns about privacy. In the area of biomedical research there are efforts to allow bona fide researchers access to data, such as dbGaP. It seamlessly provides a resource for public and private data. Instead of a hurdle, a convenient mechanism to facilitate the transition from private to open science would encourage many new participants.

What is the right access control model for open science?

Data is valuable

It is a natural human tendency to protect a scarce resource which gives them a competitive advantage. For researchers, these resources include source code and data. While it is understandable that authors of software or architects of datasets would like to be the first to benefit from their investment, it often happens that these resources are not made publicly available even after publication.

Keynotes at ACML 2013

Posted by Cheng Soon Ong on November 14, 2013

We were very lucky this year to have an amazing set of keynote speakers at ACML 2013 who have made key contributions to getting machine learning into the real world. Here are some links to the open source software projects that they mentioned during their talks. The videos of the talks should be available at some point on the ACML website

We started off with Geoff Holmes, who spoke at MLOSS 06. He told us about how WEKA has been used in industry (satisfying Kiri Wagstaff's Challenge #2), and the new project for streaming data MOA. Later in the day, Chih-Jen Lin told us how important it was to understand both machine learning and optimisation, such that you can exploit the special structure for fast training of SVMs. This is how he obtained amazing speedups in LIBLINEAR. On the second day, Ralf Herbrich (who also gave a tutorial) gave us a behind the scenes tour of TrueSkill, the player matching algorithm used on XBox Live. Source code in F# is available here and the version generalised to track skill over time is available here.

Thanks to Geoff, Chih-Jen and Ralf for sharing their enthusiasm!

What does the “OSS” in MLOSS mean?

Posted by Mark Reid on September 1, 2013

I was recently asked to become an Action Editor for the Machine Learning and Open Source Software (MLOSS) track of Journal of Machine Learning Research. Of course, I gladly accepted since the aim of the JMLR MLOSS track (as well as the broader MLOSS project) -- to encourage the creation and use of open source software within machine learning -- is well aligned with my own interests and attitude towards scientific software.

Shortly after I joined, one of the other editors raised a question about how we are to interpret an item in the review criteria that states that reviewers should consider the "freedom of the code (lack of dependence on proprietary software)" when assessing submissions. What followed was an engaging email discussion amongst the Action Editors about the how to clarify our position.

After some discussion (summarised below), we settled on the following guideline which tries to ensure MLOSS projects are as open as possible while recognising the fact that MATLAB, although "closed", is nonetheless widely used within the machine learning community and has an open "work-alike" in the form of GNU Octave:

Dependency on Closed Source Software

We strongly encourage submissions that do not depend on closed source and proprietary software. Exceptions can be made for software that is widely used in a relevant part of the machine learning community and accessible to most active researchers; this should be clearly justified in the submission.

The most common case here is the question whether we will accept software written for Matlab. Given its wide use in the community, there is no strict reject policy for MATLAB submissions, but we strongly encourage submissions to strive for compatibility with Octave unless absolutely impossible.

The Discussion

There were a number of interesting arguments raised during the discussion, so I offered to write them up in this post for posterity and to solicit feedback from the machine learning community at large.

Reviewing and decision making

A couple of arguments were put forward in favour of a strict "no proprietary dependencies" policy.

Firstly, allowing proprietary dependencies may limit our ability to find reviewers for submissions -- an already difficult job. Secondly, stricter policies have the benefit of being unambiguous, which would avoid future discussions about the acceptability of future submission.

Promoting open ports

An argument made in favour of accepting projects with proprietary dependencies was that doing so may actually increase the chances of its code being forked to produce a version with no such dependencies.

Mikio Braun explored this idea further along with some broader concerns in a blog post about the role of curation and how it potentially limits collaboration.

Where do we draw the line?

Some of us had concerns about what exactly constitutes a proprietary dependency and came up with a number of examples that possibly fall into a grey area.

For example, how do operating systems fit into the picture? What if the software in question only compiles on Windows or OS X? These are both widely used but proprietary. Should we ensure MLOSS projects also work on Linux?

Taking a step up the development chain, what if the code base is most easily built using proprietary development tools such as Visual Studio or XCode? What if libraries such as MATLAB's Statistics Toolbox or Intel's MKL library are needed for performance reasons?

Things get even more subtle when we note that certain data formats (e.g., for medical imaging) are proprietary. Should such software be excluded even though the algorithms might work on other data?

These sorts of considerations suggested that a very strict policy may be difficult to enforce in practice.

What is our focus?

It is pretty clear what position Richard Stallman or other fierce free software advocates would take on the above questions: reject all of them! It is not clear that such an extreme position would necessarily suit the goals of the MLOSS track of JMLR.

Put another way, is the focus of MLOSS the "ML" or the "OSS"? The consensus seemed to be that we want to promote open source software to benefit machine learning, not the other way around.

Looking At The Data

Towards the end of the discussion, I made the argument that if we cannot be coherent we should at least be consistent and presented some data on all the accepted MLOSS submissions. The list below shows the breakdown of languages used by the 50 projects that have been accepted to the JMLR track to date. I'll note that some projects use and/or target multiple languages and that, because I only spent half an hour surveying the projects, I may have inadvertently misrepresented some (if I've done so, let me know).

C++: 15; Java: 13; MATLAB:11; Octave: 10; Python:9; C: 5; R: 4.

From this we can see that MATLAB is fairly well-represented amongst the accepted MLOSS projects. I took a closer look and found that of the 11 projects that are written in (or provide bindings for) MATLAB, all but one of them provide support for GNU Octave compatibility as well.

Closing Thoughts

I think the position we've adopted is realistic, consistent, and suitably aspirational. We want to encourage and promote projects that strive for openness and the positive effects it enables (e.g., reproducibility and reuse) but do not want to strictly rule out submissions that require a widely used, proprietary platform such as MATLAB.

Of course, a project like MLOSS is only as strong as the community it serves so we are keen to get feedback about this decision from people who use and create machine learning software so feel free to leave a comment or contact one of us by email.

Note: This is a cross-post from Mark's blog at Inductio ex Machina.

Code review for science

Posted by Cheng Soon Ong on August 14, 2013

How good is the software associated with scientific papers? There seems to be a general impression that the quality of scientific software is not that great. How do we check for software quality? Well, by doing code review.

In an interesting experiment between the Mozilla Science Lab and PLoS Computational Biology, a selected number of papers with snippets of code from the latter will be reviewed by engineers from the former.

For more details see the blog post by Kaitlin Thaney.

GSoC 2013

Posted by Cheng Soon Ong on April 9, 2013

GSoC has just announced the list of participating organisations. This is a great opportunity for students to get involved in projects that matter, and to learn about code development which is bigger than the standard "one semester" programming project that they are usually exposed to at university.

Some statistics:

  • 177 of 417 projects were accepted, which is a success rate of 42%.
  • 40 of the 177 project are accepted for the first time, which is a 23% proportion of new blood.

These seem to be in the same ballpark as most other competitive schemes for obtaining funding. Perhaps there is some type of psychological "mean" which reviewers gravitate to when they are evaluating submissions. For example, consider that out of the 4258 students that applied for projects in 2012, 1212 students got accepted, a rate of 28%.

To the students out there, please get in touch with potential mentors before putting in your applications. You'd be surprised at how much it could improve your application!

Scientist vs Inventor

Posted by Cheng Soon Ong on March 18, 2013

Mikio and I are writing a book chapter about "Open Science in Machine Learning", which will appear in a collection titled "Implementing Computational Reproducible Research". Among many things, we mentioned that machine learning is about inventing new methods for solving problems. Luis Ibanez from Kitware pounced on this statement, and proceeded to give a wonderful argument that we are confusing our roles as scientists with the pressure of being an inventor. The rest of this post is an exact reproduction of Luis' response to our statement.

“... machine learning is concerned with creating new learning methods to perform well on certain application problems.”.

The authors discuss the purpose of machine learning, but under the untold context of “research on machine learning”, and the current landscape of funding research. To clarify, the authors imply that novelty is the purpose of machine learning research. More explicitly, that “developing new methods” is the goal of research.

This is a view (not limited to machine learning) that is commonly widespread, and that in practice is confirmed by the requirements of publishing and pursuit of grant funding. I beg to differ with this view, in the sense that “novelty” is not part of the scientific process at all. Novelty is an artificial condition that has been imposed on scientific workers over the years, due to the need to evaluate performance for the purpose of managing scarce funding resources. The goal of scientific research is to attempt to understand the world by direct observation, crafting of hypothesis and evaluation of hypothesis via reproducible experiments.

The pursuit of novelty (real or apparent) is actually a distraction, and it is one of the major obstacles to the practice of reproducible research. By definition, repeating an experiment, implies, requires and demands to do something that is not new. This distracted overrating of novelty is one of the reasons why scientific workers, and their institutions have come to consider repeatability of experiments as a “waste of time”, since it takes resources away from doing “new things” that could be published or could lead to new streams of funding. This confusion with “novelty” is also behind the lack of interest in reproducing experiments that have been performed by third parties. Since, such actions are “just repeating” what someone else did, and are not adding anything “new”. All, statements that are detrimental to the true practice of the scientific method.

The confusion is evident when one look at calls for proposals for papers in journal, conferences, or for funding programs. All of them call for “novelty”, none of them (with a handful of exceptions) call for reproducibility. The net effect is that we have confused two very different professions: (a) scientific researcher, with (b) inventor. Scientific researchers should be committed to the application of the scientific method, and in it, there is no requirement for novelty. The main commitment is to craft reproducible experiments, since we are after the truth, not after the new. Inventors on the other hand are in the business of coming up with new devices, and are not committed to understanding the world around us.

Most conference, journals, and even funding agencies have confused their role of supporting the understanding the world around us, and have become surrogates for the Patent Office.

In order to make progress in the pursuit of reproducible research, we need to put “novelty” back in its rightful place of being a nice extra secondary or tertiary feature of scientific research, but not a requirement, nor a driving force at all.

Software Licensing

Posted by Cheng Soon Ong on February 6, 2013

One of the tricky decisions software authors have to make is "What license should I use for my software?" A recent article in PLoS Computational Biology discusses the different possible avenues open to authors. It gives a balanced view of software licensing, carefully describing the various dimensions authors of software should consider before coming to a decision.

It recommends the following guidelines:

  • For the widest possible distribution consider a permissive FOSS license such as the BSD/MIT, Apache, or ECL.
  • For assuring that derivatives also benefit FOSS, choose a copyleft FOSS license like the GPL, LGPL, or MPL.
  • To those on the fence, there are hybrid or multi-licensing which can achieve the benefits of both open source and proprietary software licenses.
  • For protecting the confidentiality of your code, there is the proprietary license.

Naturally being an open source venue, I strongly encourage people to consider the first two options. We also discuss the distinction between FOSS licences in our position paper from 2007.

Chemical compound and drug name recognition task.

Posted by Martin Krallinger on January 2, 2013

CALL FOR PARTICIPATION: CHEMDNER task: Chemical compound and drug name recognition task.

( )

TASK GOAL AND MOTIVATION Machine learning methods have been especially useful for the automatic recognition of entity mentions in text, a crucial step for further natural language processing tasks. To promote the development of open source software for indexing documents with compounds and recognizing compound mentions in text.

The goal of this task is to promote the implementation of systems that are able to detect mentions in text of chemical compounds and drugs. The recognition of chemical entities is also crucial for other subsequent text processing strategies, such as detection of drug-protein interactions, adverse effects of chemical compounds or the extraction of pathway and metabolic reaction relations. A range of different methods have been explored for the recognition of chemical compound mentions including machine learning based approaches, rule-based systems and different types of dictionary-lookup strategies.

As has been the case in previous BioCreative efforts (resulting in high impact papers in the field), we expect that successful participants will have the opportunity to publish their system descriptions in a journal article.

CHEMDNER DESCRIPTION The CHEMDNER is one of the tracks posed at the BioCreative IV community challenge (

We invite participants to submit results for the CHEMDNER task providing predictions for one or both of the following subtasks:

a) Given a set of documents, return for each of them a ranked list of chemical entities described within each of these documents [Chemical document indexing sub-task]

b) Provide for a given document the start and end indices corresponding to all the chemical entities mentioned in this document [Chemical entity mention recognition sub-task].

For these two tasks the organizers will release training and test data collections. The task organizers will provide details on the used annotation guidelines; define a list of criteria for relevant chemical compound entity types as well as selection of documents for annotation.

REGISTRATION Teams can participate in the CHEMDNER task by registering for track 2 of BioCreative IV. You can register additionally for other tracks too. To register your team go to the following page that provides more detailed instructions:

Mailing list and contact information You can post questions related to the CHEMDNER task to the BioCreative mailing list. To register for the BioCreative mailing list, please visit the following page: You can also directly send questions to the organizers through e-mail: mkrallinger@cnio[HTML_REMOVED]es

WORKSHOP CHEMDNER is part of the BioCreative evaluation effort. The BioCreative Organizing Committee will host the BioCreative IV Challenge evaluation workshop ( at NCBI, National Institutes of Health, Bethesda, Maryland, on October 7-9, 2013

CHEMDNER TASK ORGANIZERS Martin Krallinger, Spanish National Cancer Research Center (CNIO) Obdulia Rabal, University of Navarra, Spain Julen Oyarzabal, University of Navarra, Spain Alfonso Valencia, Spanish National Cancer Research Center (CNIO)

REFERENCES - Vazquez, M., Krallinger, M., Leitner, F., & Valencia, A. (2011). Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications. Molecular Informatics, 30(6-7), 506-519. - Corbett, P., Batchelor, C., & Teufel, S. (2007). Annotation of chemical named entities. BioNLP 2007: Biological, translational, and clinical language processing, 57-64. - Klinger, R., Kolářik, C., Fluck, J., Hofmann-Apitius, M., & Friedrich, C. M. (2008). Detection of IUPAC and IUPAC-like chemical names. Bioinformatics, 24(13), i268-i276. - Hettne, K. M., Stierum, R. H., Schuemie, M. J., Hendriksen, P. J., Schijvenaars, B. J., Mulligen, E. M. V., ... & Kors, J. A. (2009). A dictionary to identify small molecules and drugs in free text. Bioinformatics, 25(22), 2983-2991. - Yeh, A., Morgan, A., Colosimo, M., & Hirschman, L. (2005). BioCreAtIvE task 1A: gene mention finding evaluation. BMC bioinformatics, 6(Suppl 1), S2. - Smith, L., Tanabe, L. K., Ando, R. J., Kuo, C. J., Chung, I. F., Hsu, C. N., ... & Wilbur, W. J. (2008). Overview of BioCreative II gene mention recognition. Genome Biology, 9(Suppl 2), S2.

Paper "Ten Simple Rules for the Open Development of Scientific Software" by Prlic and Procter

Posted by Mikio Braun on December 11, 2012

PLOS Computational Biology has an interesting Editorial on 10 rules for open development of scientific software. The ten rules are:

  1. Don't Reinvent the Wheel
  2. Code Well
  3. Be Your Own User
  4. Be Transparent
  5. Be Simple
  6. Don't Be a Perfectionist
  7. Nurture and Grow Your Community
  8. Promote Your Project
  9. Find Sponsors
  10. Science Counts.

The full article can be found here.

Best Practices for Scientific Computing

Posted by Cheng Soon Ong on November 28, 2012

I've been following the progress of Software Carpentry for some years now, and have been very impressed by their message that software is the new telescope, and we should invest time and effort to build up skills to ensure that our software is the best quality possible. Otherwise, how can we be sure that our new discoveries are not due to some instrument error?

They wrote a nice short paper titled "Best Practices for Scientific Computing" that highlights practices that would improve the quality of the software, and hence improve research productivity. Here are the 10 recommendations (along with the sub-recommendations).

1. Write programs for people, not computers.

1.1 a program should not require its readers to hold more than a handful of facts in memory at once

1.2 names should be consistent, distinctive, and meaningful

1.3 code style and formatting should be consistent

1.4 all aspects of software development should be broken down into tasks roughly an hour long

2. Automate repetitive tasks.

2.1 rely on the computer to repeat tasks

2.2 save recent commands in a file for re-use

2.3 use a build tool to automate their scientific workflows

3. Use the computer to record history.

3.1 software tools should be used to track computational work automatically

4. Make incremental changes.

4.1 work in small steps with frequent feedback and course correction

5. Use version control.

5.1 use a version control system

5.2 everything that has been created manually should be put in version control

6. Don’t repeat yourself (or others).

6.1 every piece of data must have a single authoritative representation in the system

6.2 code should be modularized rather than copied and pasted

6.3 re-use code instead of rewriting it

7. Plan for mistakes.

7.1 add assertions to programs to check their operation

7.2 use an off-the-shelf unit testing library

7.3 turn bugs into test cases

7.4 use a symbolic debugger

8. Optimize software only after it works correctly.

8.1 use a profiler to identify bottlenecks

8.2 write code in the highest-level language possible

9. Document the design and purpose of code rather than its mechanics.

9.1 document interfaces and reasons, not implementations

9.2 refactor code instead of explaining how it works

9.3 embed the documentation for a piece of software in that software

10. Conduct code reviews.

10.1 use code review and pair programming when bringing someone new up to speed and when tackling particularly tricky design, coding, and debugging problems

10.2 use an issue tracking tool

Predict Elections with Twitter

Posted by Cheng Soon Ong on October 12, 2012

In a rather self deprecating title "I wanted to Predict Elections with Twitter and all I got was this Lousy Paper" Daniel Gayo-Avello takes us on a tour of how hard it is to do reproducible research, and how often authors take short cuts. From the abstract:

"Predicting X from Twitter is a popular fad within the Twitter research subculture. It seems both appealing and relatively easy. Among such kind of studies, electoral prediction is maybe the most attractive, and at this moment there is a growing body of literature on such a topic. This is not only an interesting research problem but, above all, it is extremely difficult. However, most of the authors seem to be more interested in claiming positive results than in providing sound and reproducible methods."

It is an interesting survey of papers that use Twitter data.

He lists some flaws in current research on electoral predictions, but they are generally applicable to any machine learning paper (my comments in brackets):

  1. It's not prediction at all! I have not found a single paper predicting a future result. (Neither is bootstrap nor cross validation prediction)
  2. Chance is not a valid baseline...
  3. There is not a commonly accepted way of "counting votes" in Twitter
  4. There is not a commonly accepted way of interpreting reality! (In supervised learning, we tend to ignore the fact that there is no ground truth in reality.)
  5. Sentiment analysis is applied as a black-box... (As machine learning algorithm get more complex, more people will tend to use machine learning software as a black box)
  6. All the tweets are assumed to be trustworthy. (I don't know if anybody is doing adversarial election prediction)
  7. Demographics are neglected. (The biased sample problem)
  8. Self-selection bias.

The window is closing on those who want to predict the upcoming US elections from X.

John Hunter - the author of matplotlib - has died.

Posted by Soeren Sonnenburg on August 30, 2012

John Hunter the main author of matplotlib has died of cancer. For those interested, his close friend Fernando gives a bit more details here. John was an long term developer of matplotlib (even continuing while he was working in industry) and a father of three kids. You might consider donating to the John Hunter Memorial Fund.

We had John as invited speaker at one of our NIPS machine learning open source software workshops. He gave quite some entertaining talk featuring some live demo. I recall that he started with a command prompt typing everything (including fetching some live stock-exchange data) in python at some insane speed. Videolectures recorded his lecture. I don't know about others but I basically plotted all the scientific results using matplotlib and python for the last several years.

Rest in peace John - your contributions will be remembered.

Machine Learning already matters

Posted by Cheng Soon Ong on June 20, 2012

"Much of machine learning (ML) research has lost its connection to problems of import to the larger world of science and society." So begins Kiri Wagstaff's position paper that will have a special plenary session on June 29 at ICML 2012. The paper then goes on to lament about the poor state of affairs in machine learning research. The paper is an interesting read, and it addresses an important question that any adolescent field faces: "How do I justify my existence?"

I'd like to take the half full glass view. Machine Learning already matters!.

Kiri herself uses examples that show that machine learning already has impact. In her introduction, she mentions the CALO project, which forms the basis of Siri on the iPhone 4S, which has revolutionised the way the general public perceives human computer interactions. She also mentions spam detection, which Gmail has generalized to sorting all email with Priority Inbox.

A quick look around the web reveals other success stories:

  • The recent technology quarterly section of the Economist 2 June 2012 edition discusses the use of robots and how we would need to start legislating them. Ironically, in our human desire to appropriate blame in case of failure, we may have to block learning. Quoting the article: "This has implications for system design: it may, for instance, rule out the use of artificial neural networks, decision-making systems that learn from example rather than obeying predefined rules."

  • Searching for the phrase "machine learning" in PLoS Computational Biology returns 250 hits, showing how machine learning has revolutionised biological research in the high throughput age.

  • In high energy physics, particle accelerators use anomaly detection algorithms to only save data which may be interesting. The ultimate learning with data streams application.

At NIPS 2008 at the last talk of the Machine Learning in Computational Biology mini-symposium, I had the pleasure to be inspired by Thomas Lengauer's activities proposing anti-HIV therapy. I'd say that this "solves" challenge number 5 in Kiri's list. Remarkably (unfortunately?), their recommendation site, remains just that, a recommendation site, and has yet to navigate the legislative nightmare of getting a website to prescribe drugs. In an answer to a question, he said that Germany was one of the few places in the world where the legislation even allows for doctors to use such drug recommendation sites. A scan of the titles cited by the review article reveals keywords which would fit comfortably in a machine learning venue:

- multiple linear regression
- simple linear model
- prediction-based classification
- artificial neural networks
- self organising feature maps
- non-parametric methods
- sparse models
- convex optimization

But doom and gloom persists. Why? My personal opinion is that like most successful technologies, machine learning fades into the background once it has impact. In that vein of thought, we can measure the impact of machine learning by the decline of ICML, JMLR and friends. Meanwhile, I'm going to go back to making machine learning disappear...

Please join in the discussion at

Google Summer of Code 2012

Posted by Cheng Soon Ong on April 24, 2012

The list of Google Summer of Code (GSoC) students for 2012 has been announced.

For young programmers, it is probably the easiest way to get your foot into the door by showing that you can contribute to something worthwhile. For open source projects, it is an injection of fresh blood. For academics looking for programmer types, it is good way to differentiate between all the applicants with top marks from universities which you personally do not know.

Among the mentoring organisations which may be of interest to the machine learning community:

A warm welcome to everyone!

Open Access is very cheap

Posted by Cheng Soon Ong on March 8, 2012

Stuart Shieber just posted very convincing evidence that publication does not really cost that much, at least in technically savvy fields. Read about it here...

An efficient journal

"Adding it all up, a reasonable imputed estimate for JMLR’s total direct costs other than the volunteered labor (that is, tax accountant, web hosting, domain names, clerical work, etc.) is less than $10,000, covering the almost 1,000 articles the journal has published since its founding — about $10 per article. With regard to whose understanding of JMLR’s financing is better than whose, Yann LeCun I think comes out on top. How do I know all this about JMLR? Because (full disclosure alert) I am [ed. the publisher]."

This shows that Yann LeCun knew what he was talking about in the argument with Kent Anderson in the comments section of the Scholarly Kitchen. does not cost that much either. It was initially hosted by the Friedrich Miescher Laboratory in Tuebingen, and now hosted by the Technical University Berlin. All coding was done by Soeren, Mikio and I during our free time. costed a bit more because we paid a programmer and an intern for a few months at the start, and we also bought a more serious server for the heavier load. Luckily we have a PASCAL2 grant.

The real cost, as with JMLR, is the volunteer time needed. In fact, the mloss/mldata team is stretched pretty thin at the moment, and any help would be most welcome. Please contact us if you have a few free hours!

Did the MathWorks Infringe the Competition Laws?

Posted by Soeren Sonnenburg on March 2, 2012

I have just read that the EU commission is investigating whether The MathWorks did infringe the EU competition laws potentially related to its software Matlab and Simulink. An unnamed competitor made an appeal to the EU commission claiming that the MathWorks refused to provide a license for Matlab/Simulink to that certain competitor. This hinders making the competing product interoperable and makes it impossible for the competitor to perform (rightful!) reverse engineering in that case.

The original source is here

Nature Editorial about Open Science

Posted by Cheng Soon Ong on February 28, 2012

The case for open computer programs

Does open source software imply reproducible research?

There was a recent Nature editorial expounding the need for open source software in scientific endeavors. It argues that many modern scientific results depend on complex computations and hence source code is needed for scientific reproducibility. It is nice that a high profile journal has published articles promoting open source software, since it increases visibility. However, some more careful thought is required, as the message of the article is inaccurate in both directions.

Open source provides more benefits than just reproducibility

Actually, open source provides more than is necessary for reproducibility, since the licenses provides the ability to edit and extend the code, as well as preventing discriminatory practices. To be pedantic, for reproducibility, any software (even a compiled executable) would work.

We've said this before but the message is worth repeating. Open source provides:

  1. reproducibility of scientific results and fair comparison of algorithms;
  2. uncovering problems;
  3. building on existing resources (rather than re-implementing them);
  4. access to scientific tools without cease;
  5. combination of advances;
  6. faster adoption of methods in different disciplines and in industry; and
  7. collaborative emergence of standards.

See our paper for more details.

Having source code does not imply reproducibility

As the editorial observes in the final sentence of the abstract "The vagaries of hardware, software and natural language will always ensure that exact reproducibility remains uncertain, ... ". I've personally spent many frustrating hours trying to get somebody's research code compiled. In fact, one of the most common complaints by reviewers of JMLR's open source track is that they are unable to get submissions to work on their computer. The multitude of computing environments, numerical libraries and programming languages means that very often, the user of the software is in a different frame of mind compared to the authors of the source code. My advice to fledging authors of machine learning open source software is to provide a "quickstart" tutorial in the README, because everybody is impatient, and nobody will look into fixing your bugs before they are convinced that your code will do something useful for them. And yes, fixing $PATH can be tricky if you don't know exactly how to do it.

I guess the bottom line is quite an obvious statement: Good open source software will give you reproducibility and a few other additional benefits.

Tagging Project 'Published in JMLR'

Posted by Soeren Sonnenburg on February 3, 2012

I did some minor maintenance work today updating email addresses of certain projects and merging libmlpack and MLPACK.

More importantly, I did add the jmlr mloss url to all listed projects that have a corresponding jmlr publication. Since there seemed to be quite some backlog it might be worthwhile to shed some light how one gets flagged 'published in jmlr'.

Naturally, the condition is an corresponding accepted and already online jmlr mloss paper. In addition, you should remind your jmlr action editor to flag the project 'published in jmlr' by giving him the link to the abstract of your publication on the jmlr homepage. This is the field 'abs' under . He will then add this cross-reference and your project benefits from the increased visibility.

I hope that makes the process more transparent and reduces the backlog - ohh and btw we are counting 29 JMLR-MLOSS submissions - keep it going :-)


Posted by Soeren Sonnenburg on December 16, 2011

Meeting Cheng at NIPS we had a discussion on how to improve user experience of So I got my hands dirty and fixed a few minor issues on

A long standing feature request was that software that once appeared in JMLR will continue to be tagged published in JMLR and be highlighted. This should now be the case.

In addition, I again did limit the automagic pulling of r-cran packages to now happen once / month only. This should give manually updated software a higher visibility again.

If you have suggestions for improvements and don't mind to code a little python's source code is now availabe on github. In particular, if you'd like to attempt to improve the R-CRAN slurper it's code is here.

Mendeley/PLoS API Binary Battle (winners)

Posted by Cheng Soon Ong on December 6, 2011

The results of the Mendeley/PLoS API Binary Battle are out:



Share your personal genome from 23andMe or deCODEme to find the latest relevant research and let scientists discover new genetic associations.

1st runner up


Continual reviews of papers, even after they are published.

2nd runner up


Something close to my heart, programmatic interface to data!

What is a file?

Posted by Cheng Soon Ong on December 5, 2011

What is a file?

Two bits of news appeared recently:

  • The distribution of file sizes on the internet indicates that the human brain limits the amount of produced data. The article however observes that "it'll be interesting to see how machine intelligence might change this equation. It may be that machines can be designed to distort our relationship with information. If so, then a careful measure of file size distribution could reveal the first signs that intelligent machines are among us!"

  • Paul Allen's Institute has been publishing its data in an open fashion. Ironically, the article is behind a paywall. However, the Allen Institute for Brain Science has a data portal.

I wondered about the distribution of data which is clearly machine generated and in some sense most easily digested by machine as well. It turns out that it is quite difficult to find out how big files are. In some sense, for the brain atlas, the amount of data (>1 petabyte of image data alone) is more than is easily transferable across the internet. Most human users of this data would use some sort of web based visualization of the data, and hence the meaning of the word "file" isn't so obvious. In fact, there has been a recent trend to "hide" the concept of a file. One example is iPhones and iPads where you do not have access to the file system, and hence do not really know whether you are transfering parts of a file or streaming bytes. Another example is Google's AppEngine, where users access data through a database. A third example is Amazon's Silk browser which "renders" a web page in a more efficient fashion using Amazon's infrastructure rather than your local client.

If we take the extreme view that we use some sort of machine learning algorithm to filter the world's data for our consumption, this implies that all the world's data is in one "file", and we are just looking at parts of it. From this point of view, the paper about using file sizes to reveal machine intelligence is not going to work. In fact, thinking about file sizes in the first place is just plain misleading.

Linus's Lessons on Software

Posted by Cheng Soon Ong on September 27, 2011

Linus Torvalds talks about how to run a successful software project.

Two things people commonly get wrong:

“The first thing is thinking that you can throw things out there and ask people to help,”

“The other thing—and it's kind of related—that people seem to get wrong is to think that the code they write is what matters,”

The main points on how to run a successful project:

  • It is not about the code, it is about the user
  • A good workflow for the project is important, and tools may help to create a good workflow.
  • For big projects, development happens in small core groups
  • Let go, and don't try to control the people and the code

Have a look at the full article.

Software Freedom Day

Posted by Cheng Soon Ong on September 17, 2011

Happy software freedom day!

Mendeley/PLoS API Binary Battle

Posted by Cheng Soon Ong on August 29, 2011

It seems that there are many challenges being organised recently, but a recent announcement of Mendeley and PLoS caught my interest because they have different motivations. They have teamed up to create the Mendeley/PLoS API Binary Battle

First, both of them are dealing with scientific publications, things which we (academics) all care about. So, any winner of a challenge which improves how we deal with the bread and butter of academic life is of interest. Many of us already use Mendeley to manage our reading, and may also have published in PLoS, and so unlike previous competitions, the actual application may impact machine learners as a whole.

Second, the challenge is not about predicting better, but to design the most creative use of the API from either PLoS or Mendeley. This is an opportunity for fledging machine learners to define interesting learning tasks. To stimulate your creativity, here are some ideas that have been already suggested.

In my experience with the practical side of machine learning, the problem with solving real applications is not the lack of access to the data, but more that the data is in the wrong format, and is only available on CDs or something like that. So, a data API goes a long way towards defining how to interact with the objects, and how to define meaningful machine learning tasks. I think data APIs are something worth supporting.

As usual, there is prize money.

Bias corrected

Posted by Cheng Soon Ong on July 28, 2011

In 1839, Samuel George Morton published what was to be a series of works on human skulls. In an amazing series of highly detailed experiments, he objectively studied the difference in cranial capacities between human populations. In this pre-Darwinian era, his systematic approach of measuring large numbers of specimens was revolutionary. In the end he had measurements of up to 1000 skulls.

With the new century, the question he was asking (whether humans had a single origin or multiple origins) faded into obscurity, along with his work. Up till 1978, when Stephen Jay Gould published a paper in Science claiming that scientists are inherently biased. His prime example was Morton's experiments, arguing that Morton's results (caucasians had the biggest brains, Indians in the middle, and negros having the smallest) were flawed. This Science paper would probably have also lapsed into obscurity except for the fact that Gould is a wonderful communicator. His 1981 book "The mismeasure of Man" was a bestseller, and people took notice of the fact that scientists were after all human.

Did Morton fudge his results?

Forward to June 2011. Because Morton made all his results publicly available, and he had maintained exquisite details of his experiments (the equivalent of open access and open source today), paleoanthropologist Jason E. Lewis, with a team of other scientists, had another look at the orignal work. It turns out that Gould didn't look at the skulls, but just read the papers. Lewis went down to U. Penn and took measurements again from 308 of the 670 original skulls. The conclusion?

Morton did not bias his results.

This amazing saga shows how important it is to question the "truth", and furthermore how important it is to keep records and materials such that the "truth" can be reinvestigated. To quote the PLoS Biology paper: "Our results resolve this historical controversy, demonstrating that Morton did not manipulate data to support his preconceptions, contra Gould. In fact, the Morton case provides an example of how the scientific method can shield results from cultural biases."


Fear of salads

Posted by Cheng Soon Ong on June 30, 2011

At ICML yesterday, I saw two interesting papers about crowd sourcing. "Adaptively Learning the Crowd Kernel" looks at learning similarities between n objects by choosing triplets (a,b,c) and asking human experts to say wether a is nearer to b or c. The second paper, "Active Learning from Crowds" proposes a probabilistic model for choosing both examples and expert annotators actively. Unfortunately, both papers don't seem to have their software available online.

Some weeks ago, there was an outbreak of a particularly menacing strain (O104) of E.coli in Europe. Now, E.coli is one of the most widely studied organisms in biological labs worldwide, and its genome has been one of the first published way back in the last century (1997). This relatively small genome (4.6 million base pairs in length, containing 4288 annotated protein-coding genes) means that it can be sequenced quite quickly. In fact, nine isolates have now been sequenced by five different teams on four different sequencing platforms, including the Ion Torrent, Illumina HiSeq, Roche's 454 GS Junior, and most recently the Illumina MiSeq. From the sequencing perspective, this is really the first time the different next generation sequencing platforms can be compared. There will definitely be some improvements in bioinformatics pipelines once researchers understand the read errors on the different platforms better by comparing them.

All this data has been collected on github, giving an excellent crowd sourced dataset for machine learners. This rich dataset could be used to study evolution, and also to understand the mutations that caused virulence. This provides a great opportunity for the machine learning community to break out of the binary classification mold, and study some interesting new machine learning tasks.

Swamped in R-CRAN updates

Posted by Soeren Sonnenburg on May 24, 2011

It seems like the regular updates of packages in R-CRAN are starting to hide the manually updated packages on We are therefore only updating R-CRAN packages once per week (instead of daily as we used to).

I hope this gets your packages increased visibility again.

reclab prize

Posted by Cheng Soon Ong on May 17, 2011

After the success of the Netflix prize, it seems that would also like to entice smart machine learners to solve their recommendation problem too. The idea is the same, improve 10% over the baseline to win 1 million dollars.

Details are available at:

A couple of things are different though:

  • There is a 250,000 bonus for your academic institution.
  • The leaders of the Netflix prize were all using ensemble type classifiers (see literature below, and previous post), and it seems like the reclab prize wants to have some diversity by actually having "peer review" to choose the semi-finalists.
  • Instead of having a fixed training and test set, the best algorithms would be run against live traffic.
  • Since software is much smaller than the data, it makes much more sense to move source code to data than vice versa. And competitors must submit source only!
  • You can (kind of) use third party code, as long as it is on Maven. Strange restriction on the type license really. It may make sense to not allow GPL "contamination", but all the other open source licenses?

You can bias the competition to your favour by nominating your friends as reviewers. ;-)

The Netflix winners

  • Y. Koren, "The BellKor Solution to the Netflix Grand Prize" PDF (2009).
  • A. Töscher, M. Jahrer, R. Bell, "The BigChaos Solution to the Netflix Grand Prize" PDF (2009).
  • M. Piotte, M. Chabbert, "The Pragmatic Theory solution to the Netflix Grand Prize" PDF (2009).

Open Data Challenge

Posted by Cheng Soon Ong on May 13, 2011

20,000 euros to be won at:

Just so that this doesn't sound too much like a scam, this is a competition that is closing soon, and it is being organised by the Open Knowledge Foundation and the Open Forum Academy. There are four categories:

  • Ideas
  • Apps
  • Visualizations
  • Datasets

A problem with reproducible research

Posted by Cheng Soon Ong on May 3, 2011

One weird side effect of open source software and reproducible research is that it would make it much more challenging to set meaningful computational exercises for teaching.

I'm organising a course this semester that looks are various applications of matrix factorization. The students solve various matrix problems throughout semester, and apply them to solve questions such as compression, collaborative filtering, role based access control and inpainting. The various solutions to the applications are ranked, and students are graded based on their rank in class for this part of the course. At the end of the semester, there is a small project where the students have to do something novel, and write up a short paper about it. We thought about trying to encourage open source submissions to the exercises and projects, but quickly realized that it would raise the bar.

If all students submitted open solutions to their exercises, than it would quickly become a plagiarism checking nightmare for the teaching assistants, since students submitting later would be able to copy earlier solutions. However, requiring each exercise submission to be different from previous ones is also somewhat unfair, as it quickly becomes quite difficult to find new ways to solve an exercise. Just to put things in perspective, exercises are simple things like using singular value decomposition to perform image compression. However, making solutions public has all the benefits of that we know and love from open source software. More importantly in a classroom environment, we encourage the students to learn from each other's solutions and to discuss problems amongst themselves.

Fine, we thought: "we can make the solutions open after the exercise deadline". This somehow defeats the last idea of encouraging students to discuss and solve problems together. Since the lectures then cover different material by then, the students are less motivated to work on a previous exercise. More subtly, it would make the final project much more challenging. If everything was secret, then all the students had to do for the final project was whip together some "baseline" methods using their exercise submissions, and develop a "novel" method that beats their baseline. Given the short 6 week time frame for the project, we do not expect significant novelty, but something that was not presented in the lecture. However, if all student exercise solutions were open, the novelty level would quickly rise, as the students would now have a baseline of all submitted exercise solutions.

Even if we could figure out a way to time it such that the solutions could not be copied by other submissions, there is still an effect on the following year's course. Since the previous year's solutions would all be available, the new batch of students start would need to be "different" from all previous iterations of the course. Of course, some "leaks" happen already, since students get solutions from their seniors, and there are already plenty of publicly available open source solutions out there.

In essence, what we need are courses that are unique each year (in each university), and still have "easy" enough exercises.

I'm ashamed to admit that in the end, in the face of these challenges, we decided that we would keep all submissions secret, and did not push an open source idea for this course.

crisis response

Posted by Cheng Soon Ong on April 6, 2011

Can we deploy machine learning software to help out in a crisis? Are our software tools flexible enough such that we can quickly put together a prediction system within a few hours or days? I'd like to briefly examine what types of prediction tasks could be useful.

The most obvious questions relate to mapping:

  • Given sensor readings for several locations, how do we generalize to the others? This is a classical regression task, and there is a lot of work on this in geospatial analysis. There is crowd sourced data available on radiation levels in Japan, and tries to make it easier for people to submit readings. As far as I know, there has been no interpolation of results, probably due to the fear of making wrong predictions.

  • Where else do we need readings? This active learning type question has gained popularity in recent years in the machine learning community.

  • Where is help most needed? Image overlays was used after the hurricane in New Orleans and the floods in Pakistan to create before and after photos, which then were used for manually identifying priorities, planning logistics and working out access routes for relief operations. There are all sorts of machine learning questions here, such as ranking, path planning, etc.

  • Which is the closest team? Related to the previous point, when there is already a call for help, how do we decide to allocate our resources? Just allocating the nearest neighbour may not be optimal, as there may be other resources further away that are free. In addition, there is travel time to be taken into account. These are the types of questions that the sensor networks community have been investigating.

Assuming we know what problem we need to solve, it may be still a long way to getting a working implementation. One bottleneck commonly faced by applications of machine learning is the data issue, but for mapping problems, the two popular sources (google maps and open street map) have good APIs. This allows users to get data, and also to include predictions. And there seems to be lots of open source software solving the formal tasks. How much work is the remaining "glue"?

It would be wonderful if machine learning could make an impact in times of crisis.

This post was motivated by a very nice survey by Peter Suber on how open access can change things during a humanitarian crisis. "Beyond those survival basics, several forms humanitarian assistance take the form of free online access to research". The sources mentioned above comes from this newsletter.

Site Update

Posted by Soeren Sonnenburg on March 23, 2011

Dear all,

you might recently have noticed occasional hiccups of the server. However, since last friday we dropped our unstable apache setup here and switched to a load-balanced solution utilizing pound and fapws3. And it really is much faster now. Hammering tests showed that I can continuously stream the frontpage with what the network speed is (more than 10MB/s).

I took the opportunity to also fix a few minor bugs on - but as always the source code is available.

libraries vs scripts

Posted by Cheng Soon Ong on February 18, 2011

Structuring software projects is one of the major challenges in computer science. For machine learning research, software should be easy to use yet flexible. One dimension which I've found practically useful is the library versus script direction. Basically, computations are hidden away behind interfaces, which separate the library from the script.

Any software is essentially a sequence of commands, which are executed in order, to produce the desired machine learning result. However, human beings are particularly bad at dealing with large unstructured sequences (think spaghetti code), and so it is often useful to abstract away the details behind an interface. I am not going to get into the debate about what is the "right way" to perform abstraction, but am just going to use object orientation of a classifier as an example. This gives us the following toy example:

Library (abstract base class: Classifier)

  • kNN
  • NN
  • SVM
  • RF

Interface (i.e. each class should implement the following)

  • train
  • predict


  • compute k-fold cross validation
  • collect and summarise results

One question that already is apparent from this simple example is whether cross validation should be part of the library or the script. I consider the library the reusable part of code, and the script as the customization part. In essence, the script is the code that runs my library and produces the results that I can cut and paste into papers. This involves code for plotting, generating LaTeX tables, etc. So, as my code evolves, and it turns out that I use something across different papers, it migrates from the script side to the library side. So, my working definition of what goes into the script and what into the library is by looking at whether it is reused.

One advantage of structuring my code this way is that the scripts are somehow "use cases" for the library. They provide examples as to what the library interface means, and how it should be used. This natural side effect of reproducible computational results also provide a (weak) test case for future changes to the library.

Interestingly, even though I use cross validation all the time to tune hyperparameters, it has resisted all my attempts to be part of the library. I have many different versions of cross validation all over my code base. Quite irritating really, but I haven't been able to find an abstraction that works for all the different types of parameters that I tune, such as features to choose, normalization, regularization (of course), etc. Anbody have a good suggestion?

OSSC 2011: Call for papers

Posted by Cheng Soon Ong on January 25, 2011

The 2011 IEEE International Workshop on Open-source Software for Scientific Computation will be held on Oct 12-14, 2011, in Beijing, China.

Important dates:

  • Paper submission: July 24, 2011
  • Acceptance notification: August 24, 2011
  • Camera ready: September 14, 2011

For more details:

Scheduled Downtime December 17-19

Posted by Soeren Sonnenburg on December 14, 2010 will have a scheduled downtime on December 17-19 since the TU-Berlin is moving its data center. The new center will have significantly improved cooling/electricity facilities and network bandwidth.

Apologies for the inconvenience.

Call for Presentations for FOSDEM data devroom

Posted by Mikio L. Braun on December 14, 2010

Next year's FOSDEM meeting, a meeting which focuses on free and open source software, has a special meetup for data analysis and machine learning projects. The call for presentations ends on December 17, 2010 (this friday). The meeting will be held on February 5, 2011 in Brussels, Belgium.

New Journal: Open Research Computation

Posted by Mikio L. Braun on December 13, 2010

A new journal with a focus on software used in research has opened: Open Research Computation. Similar to the MLOSS Track at JMLR, the journal focuses on software submissions and sets high standards for code quality, and reusability.

The journal is also discussed in this blog post.

Posted by Soeren Sonnenburg on December 8, 2010

We have released a new community portal to collaboratively upload and define datasets, tasks, methods and challenges.

It is meant as the next steps after UCI enabling reproducible research. It is complementing and in contrast to UCI data sets can be uploaded/edited wiki-style in collaborative fashion. We support download and upload in various data formats like .matlab, .octave, .csv, .arff, .xml for your convenience. Naturally this website is web 2.0 supporting tagging, comments, email notifications, searching, browsing and a forum.

Going beyond a mere collection of datasets one can define tasks to be solved on a particular dataset including train/test split of the dataset, input and output variables and the performance measure.

One can then upload ones methods predictions and get server side evaluations and a ranking of the results based on the performance measure. Note that the site even renders receiver operator characteristic or precision recall curves.

Once you have defined a number of tasks you may group them together defining a challenge.

In contrast to other related portals, all of the public content is immediately available for download (without the need to register with the site). In addition, we supply mldata-utils that enables off-line processing of your data set, i.e. conversion from and to the standard hdf5 based format we defined, an api to download / upload content without accessing the website via a web-browser, and finally to evaluate the performance of your method.

So is the ideal platform

  1. for the data set creator who just ones to get researchers to work on their particular data set.
  2. for the machine learning benchmarking guy who developed a new fancy algorithm and is in search for a dataset/task that fits his needs.
  3. for the challenge organizer because provides all of the infrastructure to run challenges already.
  4. for the challenge participant that can conveniently download data and tasks in various formats.

Since all of is open source (including mldata-utils), we invite machine learning researchers to participate in the development. So if there is a feature missing let us know and we will try to incorporate it on the site.

Free your code

Posted by Cheng Soon Ong on November 26, 2010

"Not sharing your code basically adds an additional burden to others who may try to review and validate your work", as John Locke was quoted in a recent article in the Communications of the ACM. Of course there is the flip side to this in our competitive academic environment. As Scott A. Hissam puts it "... The academic community earns needed credentialing by producing original publications. Do you give up the software code immediately? Or do you wait until you've had a sufficient number of publications? If so, who determines what a sufficient number is?"

In a data driven computational field like machine learning, many of our results are dependent on some sort of calculation. Yes, in principle, many methods could be implemented from scratch based on a set of equations, but in practice, most people do not have the time (or the capability) to code up all prior art from scratch. In some sense good code (like a good waiter/waitress) remains in the background. My favourite example is all the linear algebra software that is common in many programming environments. Most people don't even think about the numerical complexities of finding eigenvalues since there is a "built in" function for it. This would not have been possible without the BLAS and LAPACK open source projects. So, write code, and make it open source.

"But I don't write good code..."

Nick Barnes from the Climate Code Foundation argues that you should release it anyway. In a recent opinion piece by Nick and also other famous people in a Nature News article, gives many reasons why code should be open. In his blog piece, he gives more points. Among them:

  • publication on its own is not enough
  • software skills are important and must be funded
  • open development is important
  • the longest program starts with a single line of code

The long hallway

Posted by Cheng Soon Ong on October 14, 2010

I am currently working on a project with someone I've never met, and two others who are almost a thousand km away. This is working on our sister project It's one of those things that people talk about over beer, but nobody attempts to do anything about it. Basically, how to make the experimental part of machine learning available and reproducible. Please check out our motivations and come visit our demo at NIPS.

Shameless plug isn't it?

One thing that we tried during this project that was totally new to me was actively collaborating with people who are physically distant. This was called the long hallway by Johnathan Follett, where he referred to the fact that there are many companies today with virtual offices. The experience is distinct from telecommuting, since there aren't people who are "in office". One thing we found really hard was the fact that we are effectively limited to written communication. We tried using VOIP calls but the lag across the Atlantic and poor conference calling was really quite irritating. So, our normal mode of operation is a weekly chat meeting (using Jabber) and a mailing list. One upside of the written chat meeting is that doing minutes is easier afterwards.

As you can imagine, written text is a poor substitute for face to face communication. Many things are really tough, such as trying to define a new concept. Hand waving arguments do not work (since we don't have hands to wave online), and simple misunderstandings persists for a very long time. One example that kept our mailing list busy for weeks was the concept of a training/validation/test split. One of us assumed that there is one dataset, and each training/validation/test dataset is just a subset of this whole dataset. Another assumed that it would be three different datasets. Everything was fine until we thought about how to implement "hidden labels" in challenges. If we have only one dataset, then this requires hiding part of the label. If we have three datasets, this results in hiding labels for some of the datasets. You can also see in this example that there is a subtle concept of "label" sneaking in already. What is a label? For someone working on simple supervised learning with vectorial data, this is just the relevant column in the matrix. But perhaps there may be multiple possible dependent variables? How about imputing missing values? How do we describe the learning task? What is a solution? Needless to way, such conceptual discussions were very heated, and there have been times when we feel like a good definition is not possible.

Back to the long hallway. I tend to associate a particular face and voice to written text. Since I have never met one of my collaborators, I seemed to have "made up" a particular vision of him, complete with what I think he looks like from the low-res (and probably outdated) photo on his website, and his speech timbre and accent. I was highly disconcerted during our first conference call several months into the project when he spoke in a voice that totally didn't fit my mental image. I continue to be surprised by his voice each time we talk since we don't have phone calls that often. I'm sure I'm going to be surprised by his physical appearance when I see him.

Conference on Open Access and Data

Posted by Soeren Sonnenburg on September 10, 2010

On December 13-14 there a conference on open access and open data will take place in Cologne Germany. The conference website states that it will in addition to open access now concentrate on open research data - a subject that we are currently working on with too.

Open Data: good or bad?

Posted by Cheng Soon Ong on August 19, 2010

Is sharing always good?

We've been thinking about ways to make it easy for machine learners to exchange data and methods. The assumption behind all this is that sharing is good, and we (as researchers funded by taxpayers' money) should be open with our work.

As has been mentioned before, several funding agencies are pushing for open access to the results of research. One recent story highlights the progress that has been made with Alzheimer’s, in part due to data sharing. As fledgling collaborators know, it is really hard to work in large teams of people. This project amazingly brings together the National Institutes of Health, the Food and Drug Administration, the drug and medical-imaging industries, universities and nonprofit groups. In fact, one of the key things they had to build was a way for all the different project partners to upload their data. In fact, they had two sites, one for clinical data and a second for the imaging data. And (my machine learning heart cheers) they even detail how they do cross validation. Bottom line? Data sharing has made the project possible.

Would you like your data to be public? Or first private, and only made public after you are "done" with it?

There has been some concern about genome data being uploaded directly to web servers, available to the general public. For example the Joint Genome Institute puts sequences online, and collaborators get the data at the same time as the general public. So, even if you had the idea to sequence the genome of a particularly interesting organism, someone else might scoop you to the paper if they are faster as analysing the sequences.

I think a middle ground is probably the way to go, in the words of InfoVegan, a github for data.

Moving to TU-Berlin

Posted by Soeren Sonnenburg on August 12, 2010

I have just moved the database and all content from a server running at Max-Planck in Tuebingen to TU Berlin - where two of the developers are currently working. This significantly eases maintainability and re-adds some of the functionality that we previously had on but was disabled due to security concerns. For example rss aggregators will work again as will the CRAN-R integration of their machine learning repository. In addition, the good news is that this server is twice as powerful (more memory, more hard disk space, more cpu power) and has more bandtwidth, such that we now also have ssl secured logins. Stay tuned and please notify me if you notice any glitches that could possibly have been caused by the transition.

We thank Max Planck Tuebingen for a reliable 3.5 years of hosting!

Tournament theory

Posted by Cheng Soon Ong on July 22, 2010

A rubber tapper in Malaysia gets paid based on the amount of latex that they can collect each morning. Therefore, if she produces double as much latex, she would earn double as much money. This is what standard economic theory dictates, because if marginal differences in productivity were not rewarded, then it would be a profit opportunity for someone else. However, Swiss hero Roger Federer doesn't earn twice as much money if he hits double the number of tennis balls. The problem is that it is kind of hard to measure Roger Federer's productivity in an absolute sense, and hence we reward him by comparing him with other people and paying the better player more. About 30 years ago, Lazear and Rosen wrote an influential economics paper about tournament theory, which is based on the idea of relative differences in productivity.

Since then, it has been used to describe various economic systems; sport and entertainment being the most common examples. In fact, it also can be used to argue why bosses are overpaid. "The salary of the vice president acts not so much as motivation for the vice president as it does as motivation for the assistant vice presidents." This also results in the long tail of income distribution; a few people at the top makes a lot, and many people at the bottom make very little. Recently, articles in the New York times and the Atlantic about abolishing tenure has attracted lots of comments for and against the idea of tenure for academia. Since it is hard to measure the productivity of a scientist in absolute terms, we reward them by playing them against each other when looking for new faculty and promoting the current winner.

For software, the story is similar with a small twist. For example, when you look at the revenue of top App Store apps you see the characteristic long tail. Also for machine learning software, there are a few packages that get most of the users. Admittedly, currently the benefits of being the best piece of machine learning software is a bit dubious. However, the author of a new package can choose to publish software in a new category. This different category (real or imagined) allows a fledgling programmer to have a chance at being the top of the heap. Sometimes it is an old research area with a new name, or instead of being the most accurate, the software promises to be the fastest, etc. I'm not sure what tournament theory has to say when there are multiple tournaments. Perhaps someone who knows game theory can comment?

Videos of our 3rd MLOSS Workshop at ICML in Haifa

Posted by Mikio L. Braun on July 21, 2010

Videos for our MLOSS workshop at Haifa are now online at

Software unpatentable in New Zealand

Posted by Cheng Soon Ong on July 16, 2010

In contrast to the bad news we had recently about Germany (no not about football), we have good news elsewhere. Perhaps it would turn out to be premature, but it seems like New Zealand will soon have a law that disallows software patents. This was recently announced by New Zealand's commerce minister Simon Power. There is a caveat though that inventions containing embedded software can be patented.

More discussion of the repercussions and also some thoughts about the Bilski case in the US can be found at the New Zealand Computer Society blog.

Why shouldn't software be patentable? Well, because it is abstract, and we do not want to allow 1+1=2 to be patented.

3rd MLOSS workshop at ICML 2010

Posted by Mikio L. Braun on June 29, 2010

On Friday, June 25, 2010, we held our 3rd machine learning open source software workshop at the ICML in Haifa, Israel. All in all, it was a very nice meeting. We again had two very interesting invited speakers, Gary Bradski and Victoria Stodden. This time, we also decided to have only two kinds of presentations: Either a 20 minute talk or a poster presentation with spotlight. In the last meetings, we had longer talks, shorter talks, and poster-only presentations, but we felt that the poster presentations didn't get the attention they deserved. I think the poster spotlights actually worked out quite well.

We opened the workshop with a talk by Gary Bradski, possibly best known in the machine learning community for OpenCV, an open source framework for real time image processing and computer vision. He gave a comprehensive overview of OpenCV and how it is used at Willow Garage, the startup Gary is working for right now to build an open robotics platform called ROS. When asked what he has learned about manging open source projects he admitted that he considers himself a really bad manager, but he has seen time and again that it really boils down to having a few highly motivated, excellent contributors which can make a real difference.

Victoria Stodden gave on talk on how our current scientific landscape with its terabytes of raw data, and complex data analysis procedures poses a real challenge to reproducible research. She sited a few recent incidents of big research programs which ran into trouble after the validity of their results or their methods was questioned (for example, see "Climategate", or this article about a cancer research program). She strongly advocates that data as well as code must be shared much more openly and discusses legal implications and how to overcome them. She also presented results from a survey of NIPS participants on whether they have shared code or data and why. Interestingly, most reasons for not sharing were of a personal nature (for example, lack of time to prepare documents, or fear of getting scooped by competitors), wheres reasons for sharing were mostly motivated by communitarian ideals like advancing the state of science more quickly.

We also had many interesting new projects:

  • Several projects addressed Java-centric machine learning: Jstacs and Mulan Mulan were broader frameworks, while jblas and UJMP provided fast and flexible matrix libraries for Java.

  • Learning with graphical models was covered by Libra and FastInf.

  • This year we again got pretty comprehensive Python based machine learning libraries, Scikit Learn, PyBrain, as well as Shogun, a large kernel based learning library with bindings to many different languages, including Python.

  • Finally, we also had projects which specializes more, like OpenKernel for kernel learning,, an high-quality implementation of AdaBoost, gidoc, a framework for working with handwritten text recognition, and the Dependency Modelling Toolkit, a project that deals with modelling probabilistic dependencies.

We also had a talk given by prerecorded video, a first for this workshop, which was nevertheless quite well received, mostly because the authors put a lot of effort in their video and presented a good mixture of personal presentations, slides with voice-over and screen-cast-style demos.

While we are pretty happy with the quality of our submissions and the gradual adoption of open source software practices in the machine learning community, we again saw little in terms of integration. Still, common standards have not yet evolved, and there are many similar projects running in parallel. The most common type of interoperability is that of one library providing a wrapper to other functionality, mostly SVM learners.

As a first step towards better exchangability, we also presented our new sister-site It is similar to mloss, but focusses on machine learning data sets. We have now officially launched the website in "beta-mode", so be sure to check it out whenever you have some data you want to share with other researchers, and do not hesitate to give us feedback!

The whole workshop was recorded by the guys from As soon as the talks are online, we'll let you know.

Till then, here are a few pictures from the workshop.

Software patentable in Germany

Posted by Cheng Soon Ong on May 21, 2010

It seems that the highest German appeals court in matters of civil and criminal law overruled the country's highest patent-specialized court, deciding to uphold a software patent. More analysis is available at the blog foss patents. The original German PDF of the ruling is also available as text.

From foss patents:

  • After a landmark court ruling, the German perspective on the validity of software patents is now closer than ever to that of the US.
  • Basically, Germany has now had its own Bilski case -- with the worst possible outcome for the opponents of software patents.
  • Recently, the Enlarged Board of Appeal of the European Patent Office upheld that approach to software patents as well, effectively accepting that a computer program stored on a medium must be patentable in principle.
  • Defense strategies such as the Defensive Patent License are needed now more than ever.

Defensive Patent License is being proposed by two professors from Berkeley.

Data management plan

Posted by Cheng Soon Ong on May 18, 2010

Starting October, research grant applications to the NSF need to have a data management plan. If you deal with data, and haven't got a plan yet, here is one to follow.

ICML 2010 MLOSS Workshop Preliminary Program now available

Posted by Soeren Sonnenburg on May 13, 2010

The programme of the ICML 2010 Machine Learning Open Source Software workshop is now available. All contributors should have received a notification of acceptance email by now. We thank all of you for your submissions. This year we received 16 submissions of which 5 were selected for talks and 8 for short (5 minute) poster spotlight presentations. These 13 submissions will all be presented in the poster session. A detailed schedule of the workshop is available from the workshop website.

Note that we changed the format of the MLOSS workshop slightly (compared to the previous ones taking place at NIPS): We are now going to have extended poster sessions, with hopefully all authors presenting their work in a (short) talk and posters or even live demos.

Doing so we hope to see more interaction between projects and allowed us to accept more than just 8 papers for just talks.

Conflicting OSS goals

Posted by Tom Fawcett on May 12, 2010

It occurred to me while reviewing that the goals of OSS contributors and users are quite varied. Often these goals are in conflict. For example, here are a few ways of classifying packages I've noticed:

  • library with APIs vs complete package (end-to-end). Some packages are libraries with comprehensive APIs and are meant to be used as components in larger systems (or at least they assume the larger system will handle IO, evaluation, sampling, statistics, etc.). Other packages accomodate reading from standard formats (eg CSV, ARFF) and handle evaluation and other aspects of experimentation.

  • packages that produce intelligible models (trees, rules, visualizations) vs packages that produce black-box models. Some experimenters want/demand to understand the model, and a black-box "bag of vectors" won't work no matter how good the predictions.

  • flexible, understandable code vs efficient code. Some packages are written to be clean and extensible, while others are written to be efficient and fast. (Of course, some packages are neither :-)

  • single system vs platform for many algorithms. While some researchers contribute single algorithm implementations, there is a clear trend toward large systems (Weka, Orange, scikit.learn, etc.) which are intended to be platforms for families or large collections of algorithms.

In turn, a lot of this depends on whether the user is a researcher who wants to experiment with algorithms or a practitioner who wants to solve a real problem. Packages written for one goal are often useless for another. A program designed for several thousand examples that just outputs final error rates won’t help a practitioner who wants to classify a hundred thousand cases; a package with an interactive interface is very cumbersome for someone who needs to report extensive cross-validation experiments.

It’s clear from the JMLR OSS Review Criteria ( that JMLR hasn’t thought about the wide variety of software issues. So I suggest that the organizers (and contributors) start to think of useful categories for their code that can help people understand and navigate this space.

Open Reviewing

Posted by Cheng Soon Ong on April 29, 2010

What would open reviewing look like?

Recently, there has been a feeling that the peer review system should be revamped. We had a discussion during one of the NIPS lunchtimes about what is the future of NIPS reviewing, with many interesting suggestions. Also, several conferences have recently gone double blind. John recently blogged about compassionate reviewing.

So, following the insightful summary to the various meanings of open, including non technology ones, I thought why not open reviewing?

What is being made open?

The reviews and scores of the paper, in an open access fashion. Very much like what Yann has suggested. For true openness, the reviewer's identity should be revealed.

What legal regimes are implicated?

Since reviews today are never revealed, it seems that non even copyright is implicated. But perhaps since reviews are secret, they are covered under trade secrecy?

How does openness happen?

It can happen at an organisational level, a e.g. workshop, conference or journal can declare that all reviews are open. Or an individual can decide to make his or her reviews public. In fact, there are even two levels of public, since you can make yourself known to the authors of the paper, or you can publicly display your review for everyone to see.

We have an opportunity here to do this last idea to publicly investigate a software project. Have a look at the mloss10 submissions which are currently under review for our ICML workshop. Log in and put your reviews in the comments of the respective software projects by 6 May 2010. The program committee (whose reviews unfortunately remain secret) has exactly the same information as you do by looking at the project links.

We would like 3 scores from 1-10, with 10 being best.

  1. Quality: The normal review criteria, like at JMLR
  2. Potential: For very young but interesting projects
  3. Interest: How interesting the software is for the ML community

Deadline Extension and Final Call for Contributions ICML'10 Workshop

Posted by Soeren Sonnenburg on April 12, 2010

To accomodate researchers waiting for decisions on their ICML papers (due April 16) before committing to travel to Haifa, the submission deadline for the Machine Learning Open Source Software (MLOSS) 2010 workshop has been extended to April 23. As a result of this, we have also pushed back the acceptance notification to May 8. The workshop will take place at ICML 2010, Haifa, Israel, 25th of June, 2010.

Nevertheless, the deadline for the submissions is approaching quickly. We accept all kinds of machine learning (related) software submissions for the workshop. If accepted, you will be given a chance to present your software at the workshop, which is a great opportunity to make your piece of software more known to the machine learning community and to receive valuable feedback.

Detailed submission instructions are available at We are looking forward to your contributions.

How do you structure your code?

Posted by Cheng Soon Ong on March 28, 2010

I am currently doing some refactoring of small bits of research code that I've written, and like many others before me, I've come to the conclusion that some sort of toolbox structure is appropriate for my project. Subscribing to the unix philosopy of writing small bits of code that talk to each other, I tried to see how this would apply to a typical machine learning project.

My interest lies in algorithms and I tend to work with discriminative supervised learning methods, so perhaps my design choices are biased by this. I'd be very happy to hear what other people do with their projects. I believe that there should be three types of toolboxes:

  • Data handling - including file format handling, feature creation and preprocessing, normalization, etc.
  • Learning objectives - which define the mathematical objects that we are searching through, for example hinge loss versus logistic loss, l1 versus l2 regularization. I merge kernels into this part, instead of data handling, because it really is dependent on the type of learning algorithm.
  • Numerical tools - such as convex optimization or stochastic gradient descent.

On top of that, in the interests of reproducible research, for each paper, there should be an "experimental scripts" directory that shows how to go from raw data using the toolboxes (+versions) above to the plots and tables in a particular paper.

Most projects tend to monolithic, i.e. they merge all three types of toolboxes into one project. I believe that this is due to our culture of writing a piece of code for a particular paper deadline, effectively giving a bunch of experimental scripts that include all code for data handling, mathematical objects and optimization. Often the argument is that this is the only way to make code efficient, but it also means that code has to be rewritten time and again for basic things such as computing the ROC of a classifier, or doing trace normalization of a kernel matrix, or doing "simple gradient descent". For such "easy" things, it may be actually less overhead to just recode things in your own framework, but for potentially more difficult things, such as using cuda, it would be convenient if the numerical tools library took care of it once and for all.

My current project design (in python) is also monolithic, but I intend to have different packages for data, classifiers and optimization corresponding to the three items above. Experimental scripts for reproducible research are not part of the project, but part of the paper, since I do not want to think about backward compatibility. I mean, should new versions of my code still reproduce old results, or should results be for a particular project version? I'm also using the project structure recommended by this post and this post.

Any tips from more experienced readers are most welcome! Especially on how to keep the code base flexible for future research projects.

Citing Wikipedia

Posted by Soeren Sonnenburg on March 22, 2010

I just stumbled across this blog entry which I found interesting to read.

Quoting the first paragraphs from the source above:

Now it’s well known and generally agreed that you can’t cite Wikipedia for a scientific paper or other serious academic work. This makes sense firstly because Wikipedia changes, both in the short term (including vandalism) and in the long term (due to changes in technology, new archaeological discoveries, current events, etc). But you can link to a particular version of a Wikipedia page, you can just click on the history tab at the top of the screen and then click on the date of the version for which you want a direct permanent link.

The real reason for not linking to Wikipedia articles in academic publications is that you want to reference the original research not a report on it, which really makes sense. Of course the down-side is that you might reference some data that is in the middle of a 100 page report, in which case you might have to mention the page number as well. Also often the summary of the data you desire simply isn’t available anywhere else, someone might for example take some facts from 10 different pages of a government document and summarise them neatly in a single paragraph on Wikipedia. This isn’t a huge obstacle but just takes more time to create your own summary with references.

Nat Torkington on Open Data

Posted by Cheng Soon Ong on March 9, 2010

I recently came across a blog on O'Reilly Radar about Truly Open Data, which talks about how concepts from open source software can be translated to open data. Basically, apart from just "getting the data out there", we need software tools for managing this data. I summarize his list of tools below, with some thoughts on how this may apply to machine learning data.

  • diff and patch - Perhaps we need some md5sum for binary data? It seems that most machine learners actually don't use "live" data very often, so perhaps these resources are not needed for us?
  • version control
  • releases - An obvious release point would be upon submission of a paper. One downside I realized about double blind reviewing is that one cannot release new data (or software) upon submission. Some things are just easier to do with some real bits.
  • documentation - Apart from bioinformatics data that I generated myself, I'd be hard pressed to name one dataset (apart from iris) where I know the provenance of the data.

Daniel Lemire on Open Source Software

Posted by Mikio L. Braun on February 16, 2010

Daniel Lemire has an interesting blog post on whether open sourcing your software affects your competitiveness as a researcher.

In short, here is his summary:

  1. Sharing can’t hurt the small fish.
  2. Sharing your code makes you more convincing.
  3. Source code helps spread your ideas faster.
  4. Sharing raises your profile in industry.
  5. You write better software if you share it.

Which is very much in line with why we started the whole initiative in the first place.

MLOSS 2010 - ICML Workshop just accepted

Posted by Soeren Sonnenburg on February 12, 2010

We are glad to announce that our MLOSS 2010 workshop at this years ICML conference has been accepted!

We are therefore happily accepting software submissions. The deadline for the submissions is April 10th, 2010. If accepted, you can present your software to the workshop audience, which is a great opportunity to make your piece of software more known to the machine learning community.

Like last time, we will use for managing the submissions. You basically just have to register your project with and add the tag icml2010 to it. For more information, have a look at the workshop page.

Missing values

Posted by Cheng Soon Ong on February 2, 2010

We were recently working on a way for efficiently representing data, and came across the problem of missing values. For simple tabular formats with the same type (e.g. all real values), it is convenient to store data as a 2-D array. We are thinking of a Python numpy array, but I'm sure any solution should be language independent. However, very often, datasets contain missing values, which are indicated by some special character, for example by '?' in weka's arff format. Unfortunately, the character '?' is not a real number, hence stuffing up the array.

Does anyone have a suggestion on how to deal with this?

Note that I'm not talking about something like missing value imputation, but just the question of how to represent simple tabular data in computer memory. Of course, the question can be generalized such that some features may have different types from others.

This seems like such a common problem that there must be hundreds of solutions out there...

Data and Code Sharing Roundtable

Posted by Victoria Stodden, Chris Wiggins on January 26, 2010

As pointed out by the authors of the mloss position paper [1] in 2007, "reproducibility of experimental results is a cornerstone of science." Just as in machine learning, researchers in many computational fields (or in which computation has only recently played a major role) are struggling to reconcile our expectation of reproducibility in science with the reality of ever-growing computational complexity and opacity. [2-12]

In an effort to address these questions from researchers not only from statistical science but from a variety of disciplines, and to discuss possible solutions with representatives from publishing, funding, and legal scholars expert in appropriate licensing for open access, Yale Information Society Project Fellow Victoria Stodden convened a roundtable on the topic on November 21, 2009. Attendees included statistical scientists such as Robert Gentleman (co-developer of R) and David Donoho, among others.

The inspiration for this roundtable was the leadership of the genome research community in establishing the open release of sequence data. Representatives from that community gathered in Bermuda in 1996 to develop a cooperative strategy both for genome decoding and for managing and sharing the resulting data. Their meeting resulted in the "Bermuda Principles" [13] that shaped the ensuing data sharing practices among researchers and ensured rapid data release. In the computational research community more generally the incentives and pressures can differ from those in human genome sequencing; consequently, the roundtable sought to consider the issues in a larger context. A second goal of the workshop was to produce a publishable document discussing reactions to data and code sharing in computational science. We also published short topical thought pieces [14] authored by participants, including by statistical scientists [15-16], raising awareness of the issue of reproducibility in computational science.

The Data and Code Sharing Roundtable adapted the focus of the genomics community to include access to source code as well as data, across the computational sciences. This echoes mloss's call for "the supporting software and data" to be openly distributed through the mloss repository with links to alternatively stored data collections. The Yale roundtable was organized in five parts: framing issues, examining legal barriers and solutions, considering the role of scientific norms and incentives, discussing how technological tools help and hinder sharing, and finally crafting key points for release in a statement. The agenda is available online [17] with links to each session's slide decks.

The first session framed issues across the disparate fields and was moderated by Harvard Astronomy Professor Alyssa Goodman, and featured presentations from Mark Gerstein, the Albert L. Williams Professor of Computational Biology and Bioinformatics at Yale, Randy LeVeque, the Founders Term Professor of Applied Mathematics at the University of Washington, and Alyssa Goodman herself. The second session was moderated by Frank Pasquale, the Loftus Professor of Law at Seton Hall University, and discussed legal barriers to the sharing of research codes and data and presented alternate licensing frameworks to enable sharing. Pat Brown, Professor of Biochemistry at Stanford University, moderated the session on norms and incentives, leading a discussion of publishing models, peer review, and reward structures in the scientific community. The session on computational solutions was moderated by Ian Mitchell, Computer Science Professor at the University of British Columbia, and examined computational solutions (see for example Matt Knepley's slides from that session). The final session summarized findings and recommendations to be drafted into a jointly authored published statement. The organizers are in the process of creating this "position statement," compiled from the discussions at the workshop and from "thought pieces" contributed by attendees.

We invite members of to consider contributing such a thought piece, and hope that the open source community within machine learning will find the thought pieces, slides, or position statement useful in promoting distribution of source code as part of the scientific publication process and promoting reproducible computational science more generally.


Victoria Stodden
Yale Law School, New Haven, CT
Science Commons, Cambridge, MA

Chris Wiggins
Department of Applied Physics and Applied Mathematics,
Columbia University, New York, NY


  • [1] Sonnenburg, "The need for open source software in machine learning" Journal of Machine Learning Research, 8:2443-2466, 2007;
  • [2] Social science: Gary King, the Albert J. Weatherhead III University Professor at Harvard University, has documented his efforts in the social sciences at his website He also runs The Dataverse Network, a repository for social science data and code;
  • [3] Geophysics: Stanford Geophysics Professor Jon Claerbout's efforts in Geoscience:;
  • [4] Geophysics: University of Texas at Austin Geosciences Professor Sergey Fomel's open source package for reproducible research, Madagascar:;
  • [5] Signal processing: Signal Processing at Ecole Polytechnique Federale de Lausanne, Reproducible Research Repository; including Vandewalle, Patrick and Kovacevic, Jelena and Vetterli, Martin (2009) "Reproducible Research in Signal Processing - What, why, and how" IEEE Signal Processing Magazine, 26 (3). pp. 37-47 (;
  • [6] Databases: The database community tested replication in SIGMOD 2009 submissions; cf. I. Manolescu, L. Afanasiev, A. Arion, J. Dittrich, S. Manegold, N. Polyzotis, K. Schnaitter, P. Senellart, S. Zoupanos, D. Shasha, et al. "The Repeatability Experiment of SIGMOD 2008" SIGMOD Record, 37(1):39, 2008;
  • [7] Databases: R.V. Nehme. "Black Hole in Database Research";
  • [8] Climate: "Please, show us your code" RealClimate, Rasmus E. Benestad;
  • [9] Economics: BD McCullough. "Got replicability?" The Journal of Money, Banking and Credit Archive. Econ. Journal Watch, 4(3):326-337, 2007;
  • [10] Linguistics: T. Pedersen. "Empiricism is not a matter of faith" Computational Linguistics, 34(3):465-470, 2008.;
  • [11] Computational Biology: Jill P. Mesirov. "Accessible Reproducible Research" Science 22 January 2010: Vol. 327. no. 5964, pp. 415 - 416;
  • [12] General sources on reproducibility: and;
  • [13] "Bermuda Rules: Community Spirit, With Teeth" Science 16 February 2001: Vol. 291. no. 5507, p. 1192;
  • [14] Thought pieces available via;
  • [15] "Reproducible research and genome scale biology: approaches in Bioconductor" Vincent Carey and Robert Gentleman,;
  • [16] "View Source" Chris Wiggins;
  • [17] Agenda for roundtable available via

The Open Source Process and Research

Posted by Mikio Braun on January 13, 2010

(Cross posted on

I think there is more to be learned from the open source software development process than just publishing the code from your papers. So far, we've mostly focused on making the software side more similar to publishing scientific papers, for example, through creating a special open source software track at JMLR.

However, there is more to be learned from the open source software development process:

  • "Release early, release often" Open source software is not only about making your software available for others to reuse, but it is also about getting in touch with potential users as early as possible, as closely as possible.

Contrast this with the typical publication process in science where there lie months between your first idea, the submission of the paper, its publication, and the reactions through follow-up and response papers.

  • Self-organization collaboration One nice thing about open source software is that you can often find an already sufficiently good solution for some part of your problem. This allows you to focus on the part which is really new. If existing solutions look sufficiently mature and their projects healthy, you might even end up relying on others for part of your project, which is really interesting given that you don't even know these people or have ever talked to them. But if the project is healthy, there is a good chance that they will do their best to help you out, because they want to have users for their own project.

Again, contrast this with how you usually work in science, where it's much more common to collaborate with people from your group or people within the same project only. Even if there were someone working on something which would be immensely useful for you, you wouldn't know till months later when their work is finally published. The effect is that there is lots of duplicate work, research results from different groups don't usually interact easily, and much potential for collaboration and synergy is wasted.

While there are certainly reasons while these two areas are different, I think there are ways to make research more interactive and open. And while probably most people aren't willing to switch to open notebook science, I think there are a few things which you can try out now:

  • Communicate to people through your blog, or by Twitter or Facebook, and let them know what you're working on, even before you have polished and published it. And if you don't feel comfortable to disclose everything, how about some preliminary plots or performance numbers? To see how others are using social networks to communicate about their research, have a look at the machine learning twibe, or my (entirely non-authoritative) list of machine learning twitterers, or lists of machine learning people others have compiled, or another list of machine learning related blogs.

  • Release your software as early as possible, and make use of available infrastructure like blogs, mailing lists, issue trackers, or wikis. There are almost infinitely many options to go about this, either using some site like github,sourceforge, kenai, launchpad, savannah, or by setting up a private repository, for example using [trac], or just a bare subversion repository. It doesn't have to be that complicated, though. You can even just put a git repository on your static homepage and have people pull from there. And of course, register your project with mloss, such that others can find it and stay up to date on releases.

  • Turn your research project into a software project to create something others can readily reuse. This means making your software usable for others, interface it with existing software, and also, start reusing existing software as well. It doesn't have to be large if it's useful. Have a look at mloss for a huge list of already existing machine learning related software projects.

MLOSS ICML 2010 workshop?

Posted by Soeren Sonnenburg on December 16, 2009

We are thinking of organizing an ICML 2010 workshop on machine learning open source software. Does anyone here think this is a great idea like we do? If you would see this happen, please contact us and help us organize it.


US open access policy

Posted by Cheng Soon Ong on December 14, 2009

The Office of Science and Technology Policy of the United States of America is having a public consultation on Public Access Policy, which will run till 7 January 2010. The first part (10-20 December 2009) considers implementation issues, in particular:

  • Who should enact public access policies?
  • How should a public access policy be designed?

The next two sections are (details here):

  • Features and Technology (Dec. 21 to Dec 31)
  • Management (Jan. 1 to Jan. 7)

If you care about how your research is being published, head over and give your views.

Documentation is hard to do

Posted by Cheng Soon Ong on December 4, 2009

There was an article at TechNewsWorld yesterday about the poor state of documentation in Linux. It seems that for most projects, there are two kinds of people: the users and the developers. Users always complain that the documentation is not good enough, and developers don't see the point of writing it. Funnily, once some tech savvy user starts digging around in the code a bit, he/she one day wakes up and finds that they have crossed the fence, i.e. the project which they initially said was badly documented is now what they are actively contributing to. Even worse, they also often don't write documentation themselves.

The pragmatic programmer gives two tips about documentation:

  • Treat English as just another programming language
  • Build documentation in, don't bolt it on

Then it goes on to distinguish between internal and external documentation. I think that for machine learning, the external part is really important. Very often, the users of machine learning software are not experts in the field, and "just" downloaded the code to see whether they can solve their problem. In fact, very often, the user is not even familiar with the programming language that the project is implemented in. Each language has its own idiosyncrasies, and projects should try to have at least a README file that tells the user how to get things working. Some basic things like how to compile, and specific command line operations to get the paths correct, etc. Even interpreted languages can be tricky. For example, matlab often requires the right set of addpath statements to get things working, python requires that $PYTHONPATH be set correctly.

It happens quite often that reviewers of JMLR submissions complain of not being able to "get the code working". Sometimes this is due to a deeper problem, but often it is just because the reviewer is not a user of the programming language of the submission. Now, before you criticize me and ask why I don't choose better reviewers; if you take the intersection of machine learning expertise, programming language and operating system, you often end up with only one group of people, namely the ones that submitted the project.

Matlab(tm) 7.3 file format is actually hdf5 and can be read from other languages like python

Posted by Soeren Sonnenburg on November 19, 2009

It looks like that matlab version 7.3 and later are capable of writing out objects in the so called matlab 7.3 file format. While at first glance it looks like another proprietary format - it seems to be in fact the Hierarchical Data Format version 5 or in short hdf5.

So you can do all sorts of neat things:

  1. Lets create some matrix in matlab first and save it:

    >> x=[[1,2,3];[4,5,6];[7,8,9]]
    x =
     1     2     3
     4     5     6
     7     8     9
    >> save -v7.3 x.mat x
  2. Lets investigate that file from the shell:

    $ h5ls x.mat 
    x                        Dataset {3, 3}
    $ h5dump x.mat 
    HDF5 "x.mat" {
     GROUP "/" {
      DATASET "x" {
        DATASPACE  SIMPLE { ( 3, 3 ) / ( 3, 3 ) }
        DATA {
        (0,0): 1, 4, 7,
        (1,0): 2, 5, 8,
        (2,0): 3, 6, 9
        ATTRIBUTE "MATLAB_class" {
           DATATYPE  H5T_STRING {
                 STRSIZE 6;
                 STRPAD H5T_STR_NULLTERM;
                 CSET H5T_CSET_ASCII;
                 CTYPE H5T_C_S1;
           DATA {
           (0): "double"
  3. And load it from python:

    >>> import h5py
    >>> import numpy
    >>> f = h5py.File('x.mat')
    >>> x=f["x"]
    >>> x
    <HDF5 dataset "x": shape (3, 3), type "<f8">
    >>> numpy.array(x)
    array([[ 1.,  4.,  7.],
       [ 2.,  5.,  8.],
       [ 3.,  6.,  9.]])

So it seems actually to be a good idea to use matlab's 7.3 format for interoperability.

How many NIPS papers have source code?

Posted by Cheng Soon Ong on November 12, 2009

With NIPS coming up next month, I'm curious as to how many of the authors would distribute source code corresponding to their NIPS papers. Since the 2009 papers are not yet available, I wrote a small python script to check out the number of papers having http or ftp links in the 2008 batch. The results? 5 papers reported by the script.

  • NIPS2008_1027.pdf
  • NIPS2008_0552.pdf
  • NIPS2008_0117.pdf
  • NIPS2008_0604.pdf
  • NIPS2008_0401.pdf

The search was pretty basic, so I probably detected some false positives, and missed others. Here's the python script if you want to refine the search I obtained the papers from the electronic proceedings. Warning, this is 130MB.

Online Petition

Posted by Soeren Sonnenburg on November 11, 2009

There is an ongoing Petition that is trying to persuade the German parliament (Bundestag) to make all research publications that are a result of public funding freely available.

A press realease is available from the Coalition for Action "Copyright for Education and Research" from here. Unfortunately it is in German.

Consider signing the online petition Wissenschaft und Forschung - Kostenloser Erwerb wissenschaftlicher Publikationen at the German parliament website. Note that the deadline for signatures is December 22 2009.

MLOSS progress updates for November 2009

Posted by Soeren Sonnenburg on November 11, 2009

As of today has

  • 211 software projects with 357 revisions based on
  • 23 programming languages,
  • 370 authors (including software co-authors),
  • 365 registered users,
  • 572 comments (including spam :),
  • 109 forum posts,
  • 51 blog entries,
  • 67 software ratings,
  • 90839 software statistics objects,
  • 143 software subscriptions or bookmarks.

And happy birthday - the site is live for 2 year and 1.5 months now and is steadily visited by 1200 users per week (November 2009).

And congratulations Peter Gehler, author of the most successful software project: MPIKmeans (accessed more than 11837 times).

Finally in JMLR-MLOSS 10 papers got accepted since its announcement in summer 2007.

Yes visible progress. Nevertheless, does anyone have suggestions on how we should/could improve (or even wants to help out)? I guess we should have another workshop next year? Maybe this time not at nips but ICML?

The one thing I would like to see is blog contributions from you. Whenever you stumble across something opensource and machine learning related, write any of us an email and we will put your post in this blog.

Waiting for your ideas either talk to us at any of the conferences we are attending or leave comment!

Open Access Week

Posted by Cheng Soon Ong on October 19, 2009

October 19-23 will mark the first international Open Access Week.

Re-implement or reuse?

Posted by Cheng Soon Ong on October 14, 2009

On implicit assumption of open source software is that having the source available encourages software reuse. I'd like to turn the question on its head: "Given a (machine learning) task, should I reuse available code or re-implement it?"

Why reuse

  • Standard software engineering practice Many introductory texts on software engineering teach code reuse as a good thing. This is captured in principles like DRY which says that "Every piece of knowledge must have a single, unambiguous, authoritative representation within a system." However, it refers to software within a project, and what I'm considering here is whether one should reuse code written by someone else (or written by yourself for another project). I think this principle still applies, but one has to be careful about what is exactly meant by "reuse". See Section "Study on OSS".

  • antipattern (reinvent the wheel) In our JMLR position paper, we had two points that argue for software reuse. We get faster scientific progress by reduced cost for re-implementation of methods; and it is better for combining parallel advances into one package. However, as this blog post eloquently puts it, we shouldn't reimplement the wheel, but reinventing the wheel may give us more suitable trade-offs for our particular problem.

  • Better than "hacked up" own version It is seldom tempting to re-implement an eigenvalue decomposition, and most people are quite happy to just use the one in LAPACK. This is because most of us believe that the version in LAPACK will be superior to anything we can write ourselves.

  • Save time by outsourcing maintenance If one reuses a well supported software package with an active development community, then one benefits from each update. Using LAPACK as an example again, it has evolved over the years, and most people take the benefits of numerical stability of eigenvalue decomposition for granted. Ironically, there has been several re-implementations of the eigenvalue decomposition: LAPACK is based on an earlier EISPACK which in turn is based on a package written by James Wilkinson, which were originally implemented in ALGOL. This brings us to the first reason to re-implement.

Why re-implement

  • Existing solution is not good enough This, I think, is one reason most people re-implement something. For example, if a method exists, but is not in your favourite programming language, it can often be a pain to use this code. Of course, there are ways to embed code in another language within your own code, or one can choose to use some linking tool like SWIG. Other reasons that you may not like the current solution is because it uses too much memory, takes too long, etc.

  • Educational purposes Many "simple" methods, for example boosting or k-means clustering have probably been re-implemented many times. First, because it is "easy" to implement, and second because it is often used as some sort of initial exercise in machine learning for junior researchers. Going back to the reinventing the wheel principle, this blog gives plenty of reasons to re-implement something. In essence, you should reinvent the wheel if you want to learn about wheels. It reminds me of a comment that Leon Bottou had in one of our NIPS workshops with regards to the new implementation of Torch: "If these young people want to reimplement something, you should support it."

  • Not aware of other solution This ties in to the next point about how much effort it takes to find an existing solution versus the amount of effort it takes to implement the solution yourself. In the study cited below, they argue that if it was easy to find an existing solution, for example through good search tools and powerful indexing (e.g. mloss ;-)), one would be more likely to reuse software. Apparently, many software corporations have programs to encourage reuse of software from both sides of the equation. Making it easier to find relevant code, and also enforcing designs such that existing software is reusable.

  • Existing interface changes too often This has personally happened to me a few times when utilizing a software package that has not really matured yet, and having to spend time rewriting my own code to track changes in the API of another software library. The above argument for software reuse such that we can outsource maintenance is a double edged sword; it also means that you may have to track other projects.

Study on OSS

The following study crystallized some of the ideas that I have. Incidentally, it was done by a bunch of people down the road from where I work. I may have to drop by to have a chat with them at some point.

Stefan Haefliger, Georg von Krogh, Sebastian Spaeth, "Code Reuse in Open Source Software", Management Science, vol. 54. no. 1, pages 180-193, 2008

They have many interesting empirical findings based on an in depth study of xfce4, TikiWiki, AbiWord, GNUnet, iRATE, OpenSSL. I'm just pulling out some interesting tidbits:

  • Knowledge reuse vs. software reuse One needs to separate the idea of just copying bits of code from the idea of learning something from looking at someone else's code and learning something from it. I would argue that knowledge reuse is probably what we really want in machine learning, so even if you think your software is not the cleanest implementation or the most efficient, you should still make it open source (and put it on mloss) so that someone else can learn from it. Sadly, as they point out in the paper, knowledge reuse is really hard to measure.

  • Reuse of lines of code vs. reuse of software components It seems that in the study, only a very small proportion (less than 1%) of the 6 million lines of code were copied and accredited. It seems that even though in principle one can copy bits of code, developers rarely copy code from some other project. In contrast, all the projects reused external software components. Here, they detected component reuse by effectively looking for "#include" statements from external projects. It seems that this is the dominant sort of reuse that the open source developers use. Such component reuse include the reuse of methods and algorithms from other tools. Basically, developers prefer to write "interesting" code, and just reuse software to plug the gaps for less interesting parts of the pipeline.

Last thoughts

It makes sense to me that it would be easier to "link" to a particular software component than to encapsulate it in your project (this is probably obvious to those C/C++ programmers who link all the time). Therefore the advice to those publishing software, is to think carefully about the interfaces, and to document them clearly. Standard software engineering ideas like modular design would increase the chances that some other project would reuse yours.

I probably missed out lots of stuff in the lists above, and would like to hear your thoughts...

Open-source Software for Scientific Computation

Posted by Cheng Soon Ong on September 18, 2009

The First International Workshop on Open-source Software for Scientific Computation starts today in Guiyang, China. It is strongly supported by Scilab and the OW2 consortium, and in fact there is a competition for creating toolboxes in Scilab. From the conference website:

"The aim of OSSC 09 is to provide an international forum of exchanging the knowledge of using open-source software for scientific computation within the communities from education, academics, to industries. We expect contribution papers focusing on the development and applications of open-source software for scientific computation."

As announced earlier in a forum post (the original is here), the paper submission and revision actually happens after the conference. When I first heard about this, I was totally confused, since I'm so used to the machine learning conferences which have submission deadlines a long way before the conference. Then I realised that the projects presented at the conference are based on abstracts only. On one hand, I think this is quite nice, as the full papers submitted after the conference would benefit from the feedback obtained during the conference. On the other hand, this means that for large communities, the conference itself would be extremely huge. Since open source in scientific computing is still a fledgling field, the increase in exposure for young researchers is invaluable and I think this is the right mode of operation. There has been some discussion in machine learning about how to reduce reviewer load, and perhaps one way to do so is to convert to this mode of operation. I'm not sure what this would mean for conference attendance though. More people attending because they only need to submit abstracts? Or less people coming because they do not have an "accepted paper"?

microsoft codeplex

Posted by Cheng Soon Ong on September 15, 2009

Are there machine learning open source software at codeplex? Typing machine learning into the search interface results in 16 hits, mostly results related to learning. Three projects seemed promising. The first two do not have any releases yet, but the source repository is available.

Machine Learning Framework

The project description on the site says:

Machine Learning Framework (MLF) is a library based on .NET Framework for machine learning implementation. This library consists of collection of machine learning algorithms such as Bayesian, Neural Network, SOM, Genetic Algorithm, SVM, and etc.

Objective This library can help researchers, students, and programmers to build application based on Machine Learning using .NET platform easily.

Fire Ants

The project description says:

FireAnts is an action game similar to Bomberman. It's designed to be a platform for different AI and machine learning techniques.

Microsoft Computational Biology Tools

This is the most mature of the projects. There is a link to a web interface which seemed very sluggish to me, but as is the trend in many projects for computational biology, computation is hidden behind the browser. See for example the annual Nucleic Acids Research web server issue which has 112 projects listed, or the compiled list to all projects published in NAR.

The source is available under the Microsoft Reciprocal License (Ms-RL).


Unfortunately, it seems that is not open source.

Netflix: part 1

Posted by Cheng Soon Ong on August 10, 2009

As most of you may know, the Netflix prize came to an exciting conclusion recently. The official results are not out yet about which of the top two teams on the leaderboard, The Ensemble or BellKor's Pragmatic Chaos, will win the 1 million dollar prize. The leaderboard shows the results on a public test set, but the grand prize winner will be evaluated on a secret test set by Netflix.

Anyway, I emailed the teams to ask them whether they used any machine learning open source software in their prize winning efforts. In general, the feeling I get from the responses is that both teams rolled their own solutions. They were also understandably reluctant to share their methods since the official results are not out yet, and also the fact that Netflix in essence owns the IP.

Greg McAlpin from The Ensemble was kind enough to collect information from his team and provide me with the following summary of open source software that they used. Unfortunately, they also did not want to share their machine learning methods.

Our team decided that it would be best to wait until Netflix officially announces the winner of the competition before we talk about how we used any open source software that is related to machine learning.

We used plenty of open source tools though. Different members of the team used: JAMA/TNT, Mersenne Twister, Ruby, Perl, Python, R, Linux, gcc (and tool chain), gsl, tcl, mysql, openmp, CLAPACK, BLAS, all of the CygWin GNU software

Many members of our team first met on a Drupal website. And personally, I could never have kept track of everything that was going on without TiddlyWiki.

I know that this isn't really what you were asking for. Much of the existing open source software that we were aware of was not able to handle the size of the Netflix Prize data set. I don't think that anyone got Weka or even Octave to work with the data. Some excellent new open source frameworks were created by people competing for the Netflix Prize. It was interesting to me that became the home for many open source projects (instead of sourceforge).

PLoS to publish software

Posted by Cheng Soon Ong on August 8, 2009

In a recent article on genomeweb link, they said that PLoS may start offering an open source software track in the near future. The new editor will be Robert Murphy, whose lab has published software for image analysis in protein subcellular localization.

Apparently at the same Biolink SIG at the recent ISMB in Stockholm, they also discussed publishing of data. Since we are also thinking about how to distribute data we will be watching developments at PLoS closely. They also discuss how to make papers more machine readable by semantic markup. The example used looks like it took a lot of effort from the publishers, and I wonder whether it is feasible for journals to do this for all their published papers.

A Machine Learning Forum

Posted by Cheng Soon Ong on July 21, 2009

Yoav Freund has started a new discussion forum. It already has quite a few contributors, many whose names would be familiar to machine learners. At the moment, it seems to be mostly a collection of tutorials and introductory articles, but in principle it could become a good place for open discussions.

Chris Drummond suggested that what we really need is a place for open debate. Perhaps a letters section in JMLR. Would the above forum suffice? Or do we need some sort of quality control of the published ideas (hence increasing reviewer load)?

Getting rid of Spam

Posted by Soeren Sonnenburg on July 13, 2009

You might have noticed that received increasing amounts of spam starting from June. Well actually not spam but scam (garbage trying to poison spam filters - random sentences very hard to filter out in an automated way). For that reason I have to disable comments for anonymous, i.e., users who are not logged in. Sorry for the inconvenience.

Replicability is not Reproducibility: Nor is it Good Science

Posted by Chris Drummond on July 13, 2009

I had promised Soeren that I would post a short version of my argument so we could debate it in this forum. As Cheng Soon kindly points out there is a longer version available.

One compelling argument for repositories such as mloss is reproducibility. Reproducibility of experimental results is seen as a hallmark of science. By collecting all the artifacts used in the production of the experimental results reported in a paper would, it is claimed, guarantee reproducibility. Although not explicitly stated, the subtext is that if we have any pretensions of being scientists then we have little choice but to do this.

My counter argument is that this view is based on a misunderstanding of reproducibility in science. What mloss will allow people to do is replicate experiments, but this is not reproducibility. Reproducibility requires changes; replicability avoids them. Reproducibility's power comes from the differences between an original experiment and its reproduction. The greater the difference the greater the power. One important role of an experiment is to support a scientific hypothesis. The greater the difference of any subsequent experiment the more additional support garnered. Simply replicating an experiment would add nothing, except perhaps to confirm that the original was carried out as reported. To me, this is more of a policing exercise than a scientific one, and therefore, I would claim, of much reduced merit.

GCC + Machine Learning

Posted by Cheng Soon Ong on July 5, 2009

I found a cool project recently which applies machine learning for something that affects most of us who write software. Milepost is a project that uses statistical machine learning for optimizing gcc. They point to for further development. Just to quote that they hope to do with milepost gcc:

"Next, we plan to use MILEPOST/cTuning technology to enable realistic adaptive parallelization, data partitioning and scheduling for heterogeneous multi-core systems using statistical and machine learning techniques."

There is a lot of infrastructure that needs to be built before coming to the machine learning. In the end, the machine learning question can be stated as follows:

Given M training programs, represented by feature vectors t1,...,tM, the task is to find the best optimization (e.g. compiler flags) for a new program t. In the standard supervised setting way, they collect training data for each program ti consisting of optimization (x) and run time (y) pairs. Then the machine learning question boils down to finding the parameters theta of a distribution over good solutions, q(x|t,theta). I.e. the right compiler settings (x) for a given program (represented by features t).

However, it seems that they use uniform sampling to search q(x|t,theta) for good solutions, and once they have these islands of good solutions they use 1-nearest neighbour for prediction. There seems to be a lot of scope for improvement on the machine learning side.

Open Science in Machine Learning

Posted by Cheng Soon Ong on June 16, 2009

I am giving an invited talk on mloss at the ICML Workshop on Evaluation Methods in Machine Learning, 2009. I am experimenting with the idea of writing a blog about my ideas just before giving the talk. Perhaps some of the 167 people who apparently read this blog, are at ICML and are still on the fence about which workshop to attend, will come to my talk. But more importantly for me, perhaps some of the people who see my talk can give me written feedback as comments to this blog.

The abstract of the talk is as follows:

Openness and unrestricted information sharing amongst scientists have been identified as values that are critical to scientific progress. Open science for empirical machine learning has three main ingredients: open source software, open access to results and open data. We discuss the current state of open source software in machine learning based on our experience with as well as the software track in JMLR. Then we focus our attention on the question of open data and the design of a proposed data repository that is community driven and scalable.

The main theme of the talk is that open science has three main ingredients:

  • Open Access
  • Open Source
  • Open Data

After a brief introduction to open access and open source and how it is very nice, I will give a (totally biased) historical overview of how mloss has developed. Basically, the three workshops,, and JMLR. The three main ingredients for open science in machine learning translates to:

  • The paper should describe the method clearly and comprehensively.
  • The software that implements the method and produces the results should be well documented.
  • The data from which the results are obtained is in a standard format.

The argument we have got into time and again is that openness is actually not a requirement for scientific research. Papers do not have to be open access, even though there is evidence showing its benefits. For reproducible experiments, software can be distributed as binary black boxes. Of course, one cannot extend software to solve more complex tasks without access to the source code. And data can held in confidence even after the resulting paper has been published. Ironically, one can publish an open access paper without disclosing the data. We believe that being open is the best way to perform scientific research, and if the evidence does not convince you, you can consider it a moral choice. We envision three independent but interoperable components: the data, the paper, and the software, instead of a monolithic system such as sweave.

However, one has to be a bit more precise when considering the data blob above. Most of the projects currently on actually "just" implement an algorithm or present a framework. To obtain a particular result, there are many details which do not fit nicely into the "Let us write a general toolbox for ..." mindset. We believe that a data repository should not only contain datasets like currently available repositories such as UCI and DELVE. Instead, it should host different objects:

  • Data Data available in standard formats (Containers). Well defined API for access (Semantics).

  • Task Formal description of input-output relationships. Method for evaluating predictions.

  • Solution Methods for feature construction. Protocol for model selection.

The details of the Data part have been strongly influenced by the discussion we have here. The other objects are still not so well defined.

In summary, we think open science benefits the community as a whole. For the individual, it increases visibility and broadens audience for your problems and solutions. For software, it improves extendability and usability. However a data repository is missing, for machine learners to exchange tips and tricks for dealing with real problems. We believe that For machine learning to solve real prediction tasks, we need to have a common protocol for data communication.

Let us know your comments and suggestions on how to achieve open science.

Open Source in Astronomy

Posted by Cheng Soon Ong on June 4, 2009

It seems that researchers in astronomy have also realised the benefits of open source. A group of scientists have published a manifesto which has the same views as the position paper published by machine learners. From the abstract of the astronomers' statement:

We advocate that:

  1. the astronomical community consider software as an integral and fundable part of facility construction and science programs;
  2. that software release be considered as integral to the open and reproducible scientific process as are publication and data release;
  3. that we adopt technologies and repositories for releasing and collaboration on software that have worked for open-source software;
  4. that we seek structural incentives to make the release of software and related publications easier for scientist-authors;
  5. that we consider new ways of funding the development of grass-roots software;
  6. and that we rethink our values to acknowledge that astronomical software development is not just a technical endeavor, but a fundamental part of our scientific practice.

Now isn't that cool?

Matlab to Python

Posted by Cheng Soon Ong on May 6, 2009

I came across the OMPC (one MATLAB per child) project yesterday. No, it isn't a non-profit organisation that is handing out Matlab licenses so that kids in developing countries can enjoy programming. It is an automatic translator from Matlab to Python. It tries to preserve the matlab flavour of the code, so you can carry on writing matlab like code in python. The magic is made possible by python decorators and introspection.

Those python fans out there will throw up their hands in horror ("that's so unpythonic"), but for people who just want to get the job done, it is not bad at all. Also, some useful bits of matlab syntax may become python standards in future, such as the .* operator.

A simple test with arffload.m from our dataformat project crashed at first try. Google app engine didn't like the umlaut in my institution name. After converting Zürich to Zurich, everything worked happily.

What is an "easy to build" system?

Posted by Cheng Soon Ong on April 23, 2009

One thing that reviewers of submissions to the open source track of JMLR often complain about is that the submitted software doesn't build. At this discovery, some reviewers refuse to look at the rest of the submission. I agree that being able to compile a piece of code is quite an important part of the total score but it should not be the whole story. In fact, the review criteria for JMLR (OSS track) specifically lists other important criteria. Being easy to compile would fall under "good user documentation", since it would be the end user who would benefit from an easy to build system. But in general, once a reviewer is unable to build the submission, he would provide a negatively biased review. Even worse, he may not even consider other parts of the software project.

So, why do reviewers have so much trouble compiling software? The answer is quite complicated, and I would like to try to scratch the surface of this highly charged issue. More in depth recommendations for open source projects can be found for example in Karl Fogel's online book or Eric Steven Raymond's detailed howto. I restrict this post to Linux style "download, unzip and build" type software, ignoring GUI type "double click" installations, such as .dmg packages or .exe installers.

Documentation, documentation, documentation

A number of the compilation issues would be solved if there was clear and precise documentation, and the user reads this documentation. One JMLR submission had two reviewers who could not build the system but a third who commented on how smoothly everything went. It turned out that the author had written in his cover letter that the submitted code was not complete due to file size restrictions on the jmlr website, and reviewers are supposed to get the complete code online.

Apart from documentation for the user to understand what the project does and documentation for the developer on how to extend the project, there are the installation instructions. This includes stuff like how to install, how to upgrade from previous versions, and what dependencies are required. For Linux there are some conventions about how to structure things. If possible, one should stick to one of the standard idioms for compiling software (see the next section). As an aside, Google recently released their software update system.

The build system

The traditional build pipeline is the "configure; make;" system which is popular among C projects such as GNU projects. For python projects there is the idiom or easy_install. I am not a Java expert, but there seems to be a large plethora of build tools available. At the top of my ease of installation list comes the R community which has agreed on a single distribution channel. There seems to be a few up and coming build systems such as cmake, scons, waf and jam. If one uses too exotic a build system, the reviewers probably won't have it on their box and would have to first obtain the build system. However, often one would like the nicer features provided by the newer systems. Further, often JMLR reviewers are not experts in the language that the project is written in and are not familiar with the standard idioms (but this can be fixed by good documentation). It is a tough call...

One thing I've found quite nice is when projects have instructions on how to check that your build has completed successfully. For machine learning software, this can be a small example on toy data which allows that user to confirm that things are working as they should be.


Dependencies are a double edged sword. On one hand, one would like to take advantage of the efficiency of having highly optimized libraries such as blas, lapack, boost or GNU scientific library. But this often means that you may have to track changes in the dependencies or the user may not have dependencies available. We had one JMLR submission which used a combination of python and C++. One reviewer had a terrible time trying to get it working since first he was not familiar with python dependencies and second because his linux distribution provides python headers in a different package (and he didn't know).


There are all sorts of strange things that can happen while a user is trying to install your software. One should try to follow one of the common idioms for your language such that the user feels comfortable with the build. But at the end of the day, nothing beats real life testing. So, list your software on before you try to submit to JMLR. It may just allow you to catch some installation bugs before they upset your reviewers.

Who is allowed to list software?

Posted by Cheng Soon Ong on April 2, 2009

We have been having a quite heated discussion among the organisers of about whether we should encourage the submission of open source projects which are not their own.

We wanted (and still want to) follow the model of freshmeat where we are not actually hosting any projects at all, but really just providing links to project homepages. However, since a user can upload a tar or zip repository, in principle he or she can actually do some very "bare bones" hosting just on This is in contrast to a sourceforge or googlecode style project which gives you all the infrastructure necessary to host an open source project.

So our framework actually allows anybody to submit an mloss project, not only the authors of a package. However, as far as we can tell, only authors have submitted (their own) projects. The question is why haven't anyone submitted stuff that isn't their own? Are people afraid of the competition?

On the other hand, if we start encouraging people to list software they find, will there be problems with quality? Will the original authors of the projects be upset?

MLOSS progress updates

Posted by Cheng Soon Ong on April 2, 2009

Visitor statistics

When looking at our access statistics, we see this very nice periodic curve with peak accesses on Tuesday with around 200 users per day. The slowest days? The weekends, with around 100 users per day. So, it seems that people come back to work on Monday and get their weekly fix of The peak on Tuesday is because our site is at CET and many of our accesses come from across the Atlantic. The USA leads the number of visitors list, but people from Croatia and Denmark look at the most pages (more than 6 on average).

In the last month, we've had 172 (123 unique) visitors to this blog, which is more than I thought it would be. I kind of expected that people use as a place to find some software and to update what they have. But it seems that some people actually read this blog. :-) However, it is clear that most people just come for the software. It is quite hard to tell exactly how many of our visitors are actually real people, and how many are just web crawlers. A rough guess is that at least half the visitors to are machines since they spend less than 10 seconds on the site.

Machine Learning Data

The discussion about a format for machine learning data seems to have ground to a halt. Do machine learners really not care about exchanging data automatically? Let us know your thoughts!

Bioinformatics tools

Posted by Cheng Soon Ong on March 20, 2009

Due to high throughput methods for measuring biological systems and the well developed databases for making the data publicly available, bioinformatics is faced with the problem of too many disparate sources of information. Also, the nature of biological research is such that it is not possible to ignore the big picture, hence a researcher is likely to need access to the different data sources. Recently, two large projects were announced more or less simultaneously, claiming to provide unifying tools for investigating bioinformatics data.

The first, unison aims to be a comprehensive warehouse for all things related to protein sequences. It seems to be already quite developed, with links to many large sources of protein data such as GO, NCBI, SCOP and PDB. One thing I found quite nice is they provide a tool called "BabelFish" which translates between the different naming conventions for proteins. This means that one can match the proteins referred to in different databases, and leverage on the information much more easily. The other interesting thing is that they also consider predictions to be part of "data". While predictions are considered to be second class citizens in the world of bioinformatics, it is usually necessary in poorly studied problems, or problems where measurements are expensive or take a long time. From a machine learning viewpoint, it is definitely a good thing to see. The site gives a warning when predictions are returned.

Warning: These features are from computational predictions, not experimental data. 
Although we filter features based on score or probability to improve specificity, 
the accuracy of these predictions is largely unknown and varies by method and sequence.

What is even nicer from a machine learning point of view is that all the predictions are displayed on the same plot, so one can objectively compare the predictions from various tools for the protein of interest. Furthermore, when the experimental verification is available in future, the tools can be compared objectively.

The second tool that was announced is sage, which currently has a very bare website. However, it is worth a mention here because it is based on internal work from Merck/Rosetta, and hence may provide an integrated environment for studying disease. In an interview of Eric Schadt with Bioinform, he claims that even structural data would be made available. The target launch date is 1 July 2009.

Major updates to

Posted by Soeren Sonnenburg on February 15, 2009

We have been working hard to improve the mloss experience for you.

To this end, we have implemented

  • software revisions
  • spam filters
  • a email author field

and fixed many small issues in the code. As usual the code of the webpage is available from and in this case also stands for an example on what has changed.

Tell us what you think.

Which programming language on

Posted by Cheng Soon Ong on December 22, 2008

The winner is suprisingly (for me at least) C++. And here we all thought that machine learners could only program in matlab. Speaking of matlab, why are there so many projects with matlab sources, but so few with octave?

The runner up, R, is mostly automatically obtained from the CRAN servers, so assuming that this is somehow at steady state and assuming that all languages have equal numbers of supporters, we should expect that all the major languages have more or less 40 projects each. Rounding up the top are C, Python and Java.

Here's the numbers for the site:

  • C:37
  • C++:49
  • D:1
  • Erlang:1
  • Java:22
  • lisp:2
  • Lua:1
  • Matlab:37
  • Octave:4
  • Perl:4
  • Python:34
  • R:40
  • Ruby:2

MLOSS Workshop Videos online

Posted by Mikio Braun on December 22, 2008

The videos for our MLOSS workshop at Whistler are online at Also, the discussions were recorded on video. At around 19:50 in the Video on Reproducibility, you can see me trying very hard to keep focused while not staring at the camera.

Happy Holidays, everyone!

On our NIPS workshop

Posted by Mikio Braun on December 19, 2008

On December 12, our third workshop on machine learning open source software was held in Whistler, BC, Canada. It featured two invited speakers, a host of new and exciting software projects on machine learning, and two interesting discussions where we tried to initiate new developments.

We were very glad to get two speakers from very prestigious and successful projects: John W. Eaton from octave, a matlab clone, and John D. Hunter from matplotlib, a plotting library for matlab.

John W. Eaton gave valuable insights into his experiences in running an open source project. Started in 1989 as companion software to a book on chemical reactions, its main intention was to give students something which is more accessible than Fortran. Only afterwards were people realizing that octave was very close to matlab, and over the years people were requesting better and better compatibility to matlab. The last major release, version 3, has brought even fuller compatibility with the support of sparse matrices, and a complete overhaul of the plotting functionalities. Still, octave is searching for help, in particular in the areas of documentation, mailing list maintenance, and packaging. So if you're interested, drop John a line.

Matplotlib by John D. Hunter also started as a private project. John worked on epilepsy research in neurophysiology and initially wrote what would become matplotlib to display brain waves together with related data. At some point, matplotlib has started to become so big that it practially required all of his time. By now, John is working in the finance industry but has an agreement with his employer to work on matplotlib a certain fraction of his time. We have also learned that matplotlib contains a full re-implementation of the TeX algorithms by Donald Knuth for rendering annotations in the plot.

Both speakers stressed the importance of being resilient and pointed out that they both had to go through some time (might even be years) before a project really takes off. Both also shared their insights on how difficult it can be to deal with users. On the one hand, you have to be reliable to build up trust in your project, on the other hand, there are always some users who expect full support basically for free and are unwilling to contribute.

Besides those two invited talks, we again had a number of interesting projects. The submissions this year could roughly be classified into full frameworks, projects which focus on a special type of application or algorithm, and infrastructure.

We had four different projects which are providing a full-blown environment for doing machine learning and statistical data analysis. The first talk was Torch, a full blown matlab-replacement written in a combination of Lua and C++. Torch is optimized for efficiency and large scale learning and comes with its own matrix classes (called tensors), and plotting routines. Shark is a similarly feature rich framework written in C++. For users of R, there is [kernlab][ker] which focuses on kernel methods. Finally, python was represented by mlpy, and mdp, which sported an innovative module architecture which allows to plug together data processing modules. It was very interesting to see that there exist so many different projects which have such a broad scope. It was also quiet interesting to learn that these projects weren't so much aware of one another.

Projects which were more focused on a smaller scale included Nieme, which contains algorithms for energy-based learning, libDAI, a library of inference algorithms for graphical models with discrete state spaces, and Model Monitor, a tool for assessing the amount of distribution shift in the data and sensitivity of algorithms under distribution shift. The BCPy project again provides a python layer over the BCI2000 system and allows to work with the later in a much more flexible manner.

Finally, we had projects which dealt with different aspects of infrastructure. The RL Glue project provides a general framework to connect environments and learners in a reinforcement learning framework. This project has been highly successful, and is the standard platform for a number of challenges in this area. Disco implements the map-reduce framework for distributed clustering in a particularly elegant manner for python users, based on a core is written in Erlang. The Experiment Databases for Machine learning and BenchMarking Via Weka projects address the issue of benchmarking machine learning algorithms in an automatic and reproducible way and providing a database to describe models and experimental results.

In summary, it seems that researchers are quite active in providing feature-rich high-quality open source software on machine learning. The large number of 23 submissions to this workshop also provides evidence for that. At the same time, it seems that most projects are still oblivious of each other. In particular, when it comes to interoperability, it seems that there is still a lot missing, making it hard to combine algorithms written in different languages, or code developed with respect to different frameworks.

Therefore, one of the discussions was focused on the question of interoperability. As a starting point, we proposed the ARFF file format as a common file format for exchanging data. Such a file format could serve as an important first step. Leaving more complex solutions like remote method invocation or CORBA aside, a common data format is really the simplest way to exchange data between two pieces of code which might be written in different languages or run on different platforms. As we expected, the discussion was quite lively, as the number of possible data formats is large, and the different features you could want are not always compatible. But I think what we achieved was to raise awareness for the need of interoperability. Hopefully, people will start to think about how their code could interact with other code, and standards will emerge over time.

The other discussion addressed an even more difficult question, namely that of reproducibility. How can we ensure that somebody else can reproduce the experimental results from a machine learning paper? An interesting suggestion was to require that the software producing the results is provided on a bootable live CD like a Ubuntu install CD to really make sure that the environment in which the experiments were done can be set up easily. The question was also whether you want to be able to reproduce the results at publication time, or even after ten years. Again, there is the problem of how to describe and store results in a database. Here also, we did not arrive at a conclusion, but the overall awareness could be raised hopefully.

Overall, I think the workshop was very successful and interesting. Room for improvement is always there, of course. For example, we should make sure not to forget to schedule coffee breaks next time. Also, I think we should put more emphasis on the community building aspect and less on individual projects. In 2006, the topic was so new that people didn't know what kinds of projects were out there, but now, also due to this website, the existence of open source software for machine learning is much more known. So giving projects a platform to advertise their software is certainly an important part, but thinking about what the next step is and talking about how to integrate what we already have is something I would put more emphasis on next time.

Again I (and Soeren and Cheng as well) would like to thank everybody who contributed to this workshop, and of course also the Pascal2 framework for their financial support.

Help needed

Posted by Cheng Soon Ong on December 8, 2008

NIPS is a lot like Christmas. You travel to some place once a year, meet people (some you like, others you don't), and eat a lot.

To the point of this entry. If there are any budding Django programmers out there who would like to help out with the development of, please come and talk to us. We will have a t-shirt handing out table again at NIPS, so please come by and have a chat with myself, Soeren or Mikio.

MLOSS progress updates for November 2008

Posted by Soeren Sonnenburg on November 27, 2008

Two months passed since the last statistics update, so lets see if we are progressing:

As of today has

  • 158 software projects based on
  • 19 programming languages,
  • 302 authors (including software co-authors),
  • 284 registered users,
  • 63 comments (including spam :),
  • 109 forum posts,
  • 28 blog entries,
  • 51 software ratings,
  • 31525 software statistics objects,
  • 143 software subscriptions or bookmarks.

And happy birthday - the site is live for 1 year and 1.5 months now and since we became a recent target of spammers it might show that is not that unimportant anymore. This is also documented by a traffic growth from around 300 visits per week (February 2008) to more than 1000 per week (November 2008).

And congratulations Peter Gehler, author of the most successful software project: MPIKmeans (accessed more than 6000 times).

Finally JMLR-MLOSS received

  • 20 submissions until now,
  • 5 resubmissions;
  • 3 are already accepted and published,
  • 1 is pending publication

since its announcement in summer 2007.

One may conclude - there is visible progress. However, as already pointed out in several previous blog posts - we merely see several isolated mloss projects that don't at all inter-operate with each other. And it is clear that this trend needs to be stopped, but how could we support the next steps? In case you have some bright ideas either talk to us at NIPS*08 (possibly even attend the workshop and present your ideas in the discussion) or leave a comment...

Open Source is not Interoperability

Posted by Cheng Soon Ong on November 26, 2008

Trying to prepare some thoughts about interoperability to be discussed at the NIPS workshop, I came across a bunch of websites roughly in the following order:

  • a rather negative article about the state of open source software and how they interoperate at

  • a very positive blog at ugotrade which talks about OpenSim and how it will be the next hot thing.

  • a post about why open source and interoperability are really two different things.

Quoting the third author:

  1. Interop is not open source.
  2. Interop does not require open source implementations
  3. Open source does not guarantee Interop

While one thinks that somehow it is natural for open source developers to make use of other bits of (open source) software, it usually doesn't happen. For me, interoperability can occur in two ways: the first being having a common set of protocols (as argued for by the third post above), and/or the second which is integrating another software library or method. In some sense, the "integration" idea also requires a set of protocols or APIs. It may be that I'm just being pedantic about trying to semantically differentiate between protocols and APIs. But the main idea remains: We need software that talks to other bits of software.

However, if both pieces of software are open source, we can do more than just have software that talks to other bits of software (which is why OpenSim is raising so much interest). In the process of having to push together two software projects, we may be able to come up with better interfaces between them. This is especially true in the research area (which in some sense practices carpentry) where it is not that clear from the start how programs should interact. For supervised machine learning, datasets are a good place to start. It seems "obvious" that this is one place where different machine learning algorithms can interface with each other. Even in this "simple" interface, there is a multitude of data formats and standards. Another quite fruitful area is in convex optimization, where there are several projects (even here on which easily link to different back ends, or several solvers which are used by various front ends. Interestingly, here the interfaces are actually dictated by the mathematics, and the software implementations are just mirroring these forms. I think it is within our reach to have these kinds of interoperability for many other areas of machine learning.

As for the long term goal of software systems being well integrated in the application specific fashion, I think we still have a way to go yet...

mloss08 Program

Posted by Cheng Soon Ong on November 6, 2008

Just in case you haven't checked our workshop page recently, we have finalised our program. We had a surprisingly large number of submissions, ranging from quite mature projects to small radical ideas. In the end, we decided that we should try to squeeze in as many projects as possible, and at the same time try to keep some diversity in the program; i.e. we didn't want to have all slots taken up by large mature machine learning frameworks.

Our theme this year is "interoperability, interoperability, interoperability". The dream is to have some way for machine learning software to talk to each other. We are still a long way from being able to plug and play different tools for machine learning, and we hope to make a start by discussing this at the workshop. Of course, machine learning research is not only about software, but it is also about the data. Our afternoon discussion session will be about "UCI 2.0", and how we should go about it. There was a recent editorial in Nature Cell Biology about the need for standardizing bioinformatics data, and this blog post highlights three properties of scientific data.

Hope to see you at NIPS!

Reviewing software

Posted by Cheng Soon Ong on October 13, 2008

The review process for the current NIPS workshop mloss08 is now underway. There are a couple of interesting thoughts that I had while discussing this process with Soeren and Mikio, as well as some of the program committee. The two issues are:

  • Who should review a project?
  • What are the review criteria?

Reviewer Choice

Unlike standard machine learning projects, choosing a reviewer for a mloss project has to be comfortable with three different aspects of the system, namely:

  • The machine learning problem (e.g. Graphical models, kernel methods, or reinforcement learning)
  • The programming language, or at least the paradigm (e.g. object oriented programming)
  • The operating environment (which may be a particular species of make on a version of Linux)

There is also projects about a particular application area of machine learning, such as brain-computer interfaces which put an additional requirement on the understanding of the reviewer.

However, if one looks at the set of people who satisfy all those criteria for a particular project, one usually ends up with only a handful of potential researchers, most of which would have a conflict of interest with the submitted project. So, often I would choose a reviewer who is an expert in one of the three areas and hope that he or she would be able to figure out the rest. Is there a better solution?

Review Criteria

The JMLR review criteria are:

  1. The quality of the four page description.
  2. The novelty and breadth of the contribution.
  3. The clarity of design.
  4. The freedom of the code (lack of dependence on proprietary software).
  5. The breadth of platforms it can be used on (should include an open-source operating system).
  6. The quality of the user documentation (should enable new users to quickly apply the software to other problems, including a tutorial and several non-trivial examples of how the software can be used).
  7. The quality of the developer documentation (should enable easy modification and extension of the software, provide an API reference, provide unit testing routines).
  8. The quality of comparison to previous (if any) related implementations, w.r.t. run-time, memory requirements, features, to explain that significant progress has been made.

This year's workshop has the theme of interoperability and coorperation. Therefore it is also a review criteria. The important question is how to weight the different aspects? The answer is not at all clear. There is a basic level of adherence which is necessary for each of the criteria, above which is it difficult to trade off the different aspects quantitatively. For example does very good user documentation excuse very poor code design? Does being able to run on many different operating systems excuse very poor run time memory and computational performance?

Put your comments below or come to this year's workshop and discuss this!

GNU Octave on Free Software Foundations High Priority List

Posted by Soeren Sonnenburg on October 6, 2008

The Free Software Foundation (FSF) maintains a high priority list of software projects and can be found here.

Quoting the FSF:

The FSF high-priority projects list serves to foster the development of projects that are important for increasing the adoption and use of free software and free software operating systems. [...] Some of the most important projects on our list are replacement projects. These projects are important because they address areas where users are continually being seduced into using non-free software by the lack of an adequate free replacement.

With rank eight among the top ten prioritized software projects is GNU Octave --- a free software Matlab replacement.

As this is very relevant to our community that is strongly dominated by Matlab, I would like to encourage everyone to try out octave 3.0. If you tried octave 2.x or any earlier version at some point, it really matured a lot. It supports all the data types like cell arrays, dense or sparse arrays you know from matlab and yes it has all these plotting functions like plot, surf3d etc too. And if you ever tried to extend matlab using C code, support is really much better from the octave side not to mention the killer feature: Octave is fully supported by swig! Still not convinced? We will have John W. Eaton to introduce octave to us at the NIPS'08 MLOSS Workshop. So what are you waiting for, give octave a try and see how you can help!

Differences between paid and volunteer FOSS contributors

Posted by Soeren Sonnenburg on October 3, 2008

I just stumbled across a very interesting article titled Differences between paid and volunteer FOSS contributors that I am going to almost fully quote below. The original article was written by Martin Michlmayr and can be found here. Almost full quote follows:

There's a lot of debate these days about the impact of the increasing number of paid developers in FOSS communities that started as volunteer efforts and still have significant numbers of volunteers. Evangelia Berdou's PhD thesis "Managing the Bazaar: Commercialization and peripheral participation in mature, community-led Free/Open source software projects" contains a contains a wealth of information and insights about this topic.

Berdou conducted interviews with members of the GNOME and KDE projects. She found that paid developers are often identified with the core developer group which is responsible for key infrastructure and often make a large number of commits. Furthermore, she suggested that the groups may have different priorities: "whereas [paid] developers focus on technical excellence, peripheral contributors are more interested in access and practical use".

Based on these interviews, she formulated the following hypotheses which she subsequently analyzed in more detail:

  1. Paid developers are more likely to contribute to critical parts of the code base.
  2. Paid developers are more likely to maintain critical parts of the code base.
  3. Volunteer contributors are more likely to participate in aspects of the project that are geared towards the end-user.
  4. Programmers and peripheral contributors are not likely to participate equally in major community events.

Berdou found all hypotheses to be true for GNOME but only hypothesis two and four were confirmed for KDE.

In the case of GNOME, Berdou found that hired developers contribute to the most critical parts of the project, that they maintained most modules in core areas and that they maintained a larger number modules than volunteers. Two important differences were found in KDE: paid developers attend more conferences and they maintain more modules.

Berdou's research contains a number of important insights:

  • Corporate contributions are important because paid developers contribute a lot of changes, and they maintain core modules and code.
  • While it's clear that the involvement of paid contributors is influenced by the strategy of their company, Berdou wonders whether another reason why they often contribute to core code is because they "develop their technical skills and their understanding of the code base to a greater extent than volunteers who usually contribute in their free time". It's therefore important that projects provide good documentation and other help so volunteers can get up to speed quickly.
  • Since many volunteers cannot afford to attend community events, projects should provide travel funds. This is something I see more and more: for example, Debian funds some developers to attend Debian conference and the Linux Foundation has a grant program to allow developers to attend events.
  • Paid developers often maintain modules they are not paid to directly contribute to. A reason for this is that they continue to maintain modules in their spare time when their company tells them to work on other parts of the code.

The rest of the article can be found here.

Deadline extension mloss 08

Posted by Cheng Soon Ong on September 30, 2008

Murphy's law has struck us. After happily running for more than a year, the hardware that is running is facing some strange difficulties the day before our deadline for mloss 08. So, if you cannot submit, don't panic.

So, to be fair we've decided to extend the deadline to next Monday.

Final Call for Contributions: NIPS*08 MLOSS Workshop

Posted by Soeren Sonnenburg on September 25, 2008

This is the final call for contributions for the NIPS*08 MLOSS workshop to be held on Friday, December 12th, 2008 in Whistler, British Columbia, Canada.

The deadline for the submissions is approaching quickly, just one week remains until October 1, 2008. We accept all kinds of machine learning (related) software submissions for the workshop. If accepted, you will be given a chance to present your software at the workshop, which is a great opportunity to make your piece of software more known to the NIPS audience and to receive valuable feedback.

We have decided to use for managing the submissions. You basically just have to register your project with and put the tag nips2008 to it. For more information, have a look at the workshop page.

Data sources

Posted by Cheng Soon Ong on September 19, 2008

For people who are interested in algorithms development, we are often faced with the "have a hammer, looking for a nail" problem. Once we have confirmed that the standard machine learning datasets (for example at UCI ) do not offer a useful application area where does one go? Below, I look at four websites which list data and also software associated with data. The information is not collected with machine learning in mind, and so a user would probably need to write preprocessing scripts to convert stuff into something useful.

A common theme is that just providing blobs of data isn't enough, one has to provide data as well as interfaces or processing tools for it. The other common theme is that these are just listings of data, and not an archival copy.


This is a site for large data sets and the people who love them: 
the scrapers and crawlers who collect them, 
the academics and geeks who process them, 
the designers and artists who visualize them. 
It's a place where they can exchange tips and tricks, 
develop and share tools together, and begin to integrate their particular projects. classifies the activities that people want to do to data into three different ones: get, process, view. In the get section, they provide a list of links to sources of data, which includes things from US congressional district boundaries to stock ticker data which requires a (free) registration. Unfortunately, the list of datasets is a static list, and does not provide useful slicing capabilities. In the view section, there is a nice list of different visualizations of datasets, for example a visualization of trends in twitter or worldmapper which morphs the area of a country to correspond to the size of a certain variable of interest, such as the number of internet users.

However, the really nice thing about this site is that for each section, it lists tools of the trade and tips and tricks which are bits of software which are related to collecting, processing and visualizing data. These are the kinds of things which simplifies our data analysis tasks. There doesn't seem to be a tool for each of the data sources listed yet, which means that a machine learner may still need to write his scraping tool to get data.


There are many sources to find out something about everything. 
Until now, there’s been no good place for you to find out everything about something.

This site is still in beta, and currently only provides a list of datasets. They promise to allow uploading of your datasets in the full version. What's nice about the design is that you can slice the list of datasets according to a list of predefined fields or tags. So, in a sense, the design is very much like, depending on community involvement to keep the repository fresh and up to date. Most of the data seems to be in tabular format (csv, xls), but they support yaml, which means that in principle more complex structures can exist.

They provide the infinite monkeywrench which is a scripting language to process data.

(the site seems to be having some problems recently, possibly due to the imminent v1.0)


Datamob highlights the connection between public data sources 
and the interfaces people are building for them

They list hot new datasets and hot new interfaces, which are the latest listings. They have a short list of machine learning data which includes the venerable UCI and also Netflix. There is a simple submit form which allows one to add a link to the source of data or interface. They don't aim to be comprehensive but instead but rather the best place to see how public data is being put to use online. However, it is a pity that the two lists seem to be independent. It would be nice to see which datasets uses which interfaces.

Looking at one of the visualizations (under interfaces) of the 2008 presidential donations, it pointed out something interesting: often when visualizing data, there are not enough pixels on a screen to represent what you want.


Those familiar with freshmeat, CPAN or PyPI 
can think of CKAN as providing an analogous service for open knowledge.

They package data in a predefined format, which allows them to design an API. In particular, they encourage open data, that is material that people are free to use, reuse and redistribute without restriction. The predefined package allows them to attach much more meta-data to each submission, and in the long run would allow more automated processing. For example, they allow the download of the meta-data of citeseer, which is dublin core compliant with additional metadata fields, including citation relationships (References and IsReferencedBy), author affiliations, and author addresses.

The REST API essentially defines how client software can upload and download data, and allows querying of what resources are available.

NIPS Workshop 2008 accepted

Posted by Mikio Braun on September 8, 2008

We are glad to announce that our workshop at this years NIPS conference has been accepted! We are tentatively scheduled for Friday, December 12th, 2008. The workshop will be held at Whistler, British Columbia, Canada.

We accept software submissions for the workshop. The deadline for the submissions is October 1, 2008. If accepted, you can present your software to the workshop audience, which is a great opportunity to make your piece of software more known to the NIPS audience.

We have decided to use for managing the submissions. You basically just have to register your project with and put the tag nips2008 to it. For more information, have a look at the workshop page.

New JMLR-MLOSS publication and progress updates for September 2008

Posted by Soeren Sonnenburg on September 3, 2008

Again almost two months have passed since the last progress report. Well as Cheng already posted, we finally took the time and made a slightly polished version of the source code available.

And the usual statistics follows, now has 235 registered users and 129 software projects.

Finally, the mloss project liblinear - a library to very train linear SVMs in very little time - got accepted in JMLR and we again highlight the software interlinking it with the jmlr publication.

Software Freedom Law Center on GPL compliance

Posted by Mikio Braun on August 22, 2008

The Software Freedom Law Center has posted a guide on how to ensure that you do not violate the GNU Public License when using GPL'd software in your project. ArsTechnica also has a few comments.

The guide might also come in very handy if you're legal department is eager to learn more about the implications of using open source software.

Wuala, social online storage

Posted by Cheng Soon Ong on August 15, 2008

There was a small party on last night to celebrate the beta launch of Wuala, the latest in a long line of online storage services. The idea of online storage is compelling: no need to synchronise all your different computers, somebody else takes care of you backup, easy to share data with others. However, the reality of the situation is that there is no free lunch, and for most people, the cost of online storage is prohibitive. There are several free services (for example the list here), but in general, you cannot just upload everything to the cloud and throw away your hard drive.

Wuala lets you store anything -- photos, videos, your latest paper -- for free, with no bandwidth or file size limits. What's the catch? You have to contribute storage, megabyte for megabyte, to the service. You get 1GB free to start with, but any extra space that you need, you have to plug in your own hard drive and offer it for them to add to the cloud. So, basically you convert your hard drive from a private one person device to a shared device with bits of data from everyone. Like GFS, it creates redundant copies of data and distributes them on commodity hardware, and in the case of Wuala, the commodity hardware is your hard drive and the data bus is the internet. When a user transfers data to and from Wuala, they push and pull P2P style from all the different hard drives of their members.

There are two ways to access Wuala, via a web browser and via an application that runs on your computer. The linux version of the application effectively needs the user to have root access to his box, since it calls for an fstab entry. So, for those linux users in academic environments who have centralized admins, this makes life difficult for you. The web browser interface uses java. Their website was a bit slow this morning when I tried it, so be patient with them.

Personally, for storage and backup, I think there are better ways to do it (e.g. buying an external hard drive, cloning my current laptop drive and leaving the external disk with a good friend that I meet regularly). However, if you are sharing data among collaborators, this seems like a wonderful thing to have. Each member of the team contributes some amount of disk space and bandwidth, and Voilà!

Walking the walk

Posted by Cheng Soon Ong on August 14, 2008

We have made the source of available at:

This site is based on Django, and we have borrowed several components from other open source projects. We hope that by making the source of this site open, we can benefit other communities who also want to build a similar type of site. If you do build a site which lists open source software, and you have some projects which could be of interest to the machine learning community, please let us know. We would love to be able to regularly (automatically) update our site from external sources like what we are currently doing with CRAN (see the earlier blog).

Also, some personal communication from a disgruntled new user convinced us that we should have our forum more clearly located. So, now we have added a new tab to our navigation bar. Hopefully we will have a more lively forum now that it is not "hidden away".

Finally, one plea to those budding python programmers out there who believe in the cause, please join the team.

To those wondering where the headline comes from:

Interoperability and the Curse of Polyglotism

Posted by Mikio Braun on August 12, 2008

It seems that this homepage is steadily growing. We already have a large number of registered projects covering many different applications and machine learning methods. Time to think where we're heading with all of this.

I think one of the first goals of this whole endeavor is that you can easily find software to methods published elsewhere. Irrespective of whether you're interested in comparing your own method against some method, or if you actually want to apply the method to some real data, being able to find and download the software is a huge improvement with respect to having to re-implement the method based on the paper.

However, I think that ultimately it would be great if some form of interoperability between different software packages which address the same problem would evolve. In particular in a field as machine learning where the number of (abstract) problems is relatively slow, and there exist many competing methods for a given problem (like, for example, two-class classification on vectorial data), and being able to replace one of these methods easily with another one would be very useful.

The way to achieve this is, as everywhere else in the industry, to develop standards. Actually, there are many different level where such standards could be defined, ranging from web-services, over binary APIs to data file formats.

A few week's ago, I advocated the use of modern scripting languages like python or ruby to develop new machine learning toolboxes, but actually with respect to interoperability, this "polyglotism" puts up some new problems. Back in the "old days" when people where mostly using compiled languages, making your software usable for others was a matter of creating a library which could then be linked against new programs. Differences in calling conventions aside, this approach was relatively flexible, for example, you could use a Fortran library in C or a C library in C++.

But if you use a scripting language like python, you can use that library only in python. You cannot like your C file against the python module, or import the module in another language like ruby. If you want to re-use some library in python in another language, you have to invest in some more infrastructure.

The hard way would be to set up a language-agnostic interface to your python code, for example by creating a web-service, or use some form of protocol like CORBA.

The low cost version would be to settle on a common data format. Then, you can in principle combine methods from different environments by storing intermediate results in files. It won't be fast, but it will work.

To support his approach, we have started a discussion some time ago, where we have settled on the ARFF format as a possible starting point. Furthermore, we have started to write and/or compile code for reading and writing ARFF files for a large number of programming languages, such that you do not have to write the file format yourself.

Django 1.0

Posted by Cheng Soon Ong on August 6, 2008

The framework that is based on, django, is now approaching version 1.0. So far, we have been using the SVN version of django.

So, of course we are planning to move to django version 1.0 when it become available, and depending on how much time we have maybe even track the betas. To all those silent users out there, please let us know if you find anything strange or wrong with

Final Call for Comments: MLOSS NIPS*08 Workshop

Posted by Soeren Sonnenburg on July 11, 2008

This is a final call for comments regarding our NIPS'08 MLOSS Workshop proposal, which we will be sending to the NIPS workshop organizers next thursday (July 17).

As mentioned before we managed to secure a number of high profile invited speakers, like the author of octave - John W. Eaton and the author of matplotlib John D. Hunter.

Apart from this we decided to have a discussion in the morning and in the afternoon, to discuss

What is a good mloss project?

  • Review criteria for JMLR mloss
  • Interoperable software
  • Test suites

Reproducible research

  • data exchange standards
  • shall datasets be open too? How to provide access to data sets.
  • Reproducible research, the next level after UCI datasets

Finally we invite others of mloss software to present their projects. This time submission will be done in a radically new way, i.e. to submit:

  • Tag your project with the tag nips2008
  • Ensure that you have a good description (limited to 500 words)
  • Any bells and whistles can be put on your own project page, and of course provide this link on

We very much invite feedback and are looking for active co-organizers too!

New JMLR-MLOSS publication and progress updates for July 2008

Posted by Soeren Sonnenburg on July 7, 2008

Almost two months have passed since the last progress report. Well the biggest news is the recent pulling of R machine learning packages. This lead to 35 additional projects on and we are now at 120 projects and 224 registered users at

We also made a lot of progress regarding the upcoming NIPS'08 MLOSS Workshop proposal and managed to secure a number of high profile invited speakers, like the author of octave - John W. Eaton and the author of matplotlib John D. Hunter, as well as the program committee. In case you have suggestions - let us know! We will otherwise submit the proposal in the next weeks. Although we planned to have t-shirts at ICML'08 it remained unclear whether we are being reserved a table to distribute them. We therefore decided to postpone t-shirts for NIPS'08 again. After all it makes a lot of sense to distribute them there in case we get the workshop accepted :-).

Finally, the SHARK C++ Machine Learning Library got accepted in JMLR. We again highlight the software interlinking it with the jmlr publication. Note that SHARK in contrast to LWPR is the first full fledged toolbox - implementing more than just a single algorithm - that got accepted.

Interlinking with the R Machine Learning Community

Posted by Soeren Sonnenburg on June 24, 2008

When it is about scientific computing, one of the best organized and experiences open source communities is the R community. They, already a long time ago managed to develop a free alternative to S. Nowadays the R community offers a wide variety of well categorized packages. We are proud to announce that with the help of Torsten Hothorn, Kurt Hornik and Achim Zeileis we are now automagically listing packages from the R-cran machine learning section.

10 from 133

Posted by Cheng Soon Ong on June 17, 2008

There was a paper at the beginning of this year by Budden et. al. A who looked at double blind reviews, and claimed that double blind reviews increases the proportion of accepted papers with female first authors. Soon after, Webb et. al. responded that actually the trend is true for other (non double blind) journals too. Recently, Budden et. al. B reanalysed the data, and rebutted the rebuttal.

The blog article at Sandwalk looks at this issue in more detail.

But, here at mloss, we have no review process (yet), and there is no bias against women. Or is there? Out of the 133 author names listed on my guess is there are 10 women. Mind you, my ability to judge whether a name is from a guy or a girl is not 100% correct, but I think the estimate is pretty good. Where are all the women who write mloss? The fact remains that less than 10% of the authors of projects that appear on mloss are women. What do you think? Why is this the case?


  • Budden, A., Tregenza, T., Aarssen, L., Koricheva, J., Leimu, R. and Lortie, C. (2008A) Women, Science and Writing. Trends in Ecology & Evolution, 23(1), 4-6.
  • Budden, A.E., Lortie, C.J., Tregenza, T., Aarssen, L., Koricheva, J., and Leimu, R. (2008B) Response to Webb et al.: Double-blind review: accept with minor revisions. Trends in Ecology and Evolution
  • Webb, T. J., O'Hara, B. and Freckleton, R. P. (2008) Does double-blind review benefit female authors? Trends in Ecology and Evolution

Data repositories

Posted by Cheng Soon Ong on June 12, 2008

I read an interesting blog at Science in the open about the problems he has about institutional repositories.

"But the key thing is that all of this should be done automatically and must not require intervention by the author. Nothing drives me up the wall more than having to put the same set of data into two subtly different systems more than once."

I think this is not limited to institutional repositories. This is in general true for all repositories. While web forms are nice, it is extremely irritating for a researcher to manually upload data and manually fill out information more than once. The question is how to automate the distribution of data and metadata once it has been manually included somewhere?

Maybe this is all a pipedream, but would it not be possible to have some way of reconstructing metadata by the way the data is used and accessed (say based on web links)? Of course, if we are trawling the web and slurping up data, how do we know what is open access and what is not? One of the comments in the blog above mentioned Romeo which is a list of open access journals. Would this also work for open data? From the same people (eprints, which incidentally powers pascal networks' eprints), we get some examples of how other repositories can be built, for example data repositories.

NIPS*08 Deadline Fever

Posted by Soeren Sonnenburg on June 7, 2008

Does this picture look familiar to you?

NIPS Server Load

Well it is over now, but hey why are people always last minute (and I am obviously no exception here)? Which reminds me that for the planned nips workshop wouldn't it be a good idea to use as the submission system, i.e. instead of receiving emails from people ask contributors to announce their project at including a reasonable description and setting a nips08 tag? And hey who knows if is capable of dealing with that load :-)

Style checking in python

Posted by Cheng Soon Ong on June 5, 2008

Python is an interpreted language, and hence there are some bugs which only get caught at run time. I recently had a discussion about how irritating it is that programs crash due to errors which can easily be caught at "compile time". My view is that compilation is something that should be transparent to the programmer, and one should still be able to catch all these silly errors while coding. Of course, there is already lots of work in programming languages for this. Paradoxically, most of the concepts were developed for compiled languages.

From wikipedia:

"In computer programming, lint was the name originally given to a particular program that flagged suspicious and non-portable constructs (likely to be bugs) in C language source code. The term is now applied generically to tools that flag suspicious usage in software written in any computer language. The term lint-like behavior is sometimes applied to the process of flagging suspicious language usage. lint-like tools generally perform static analysis of source code."

For python, I found three projects which seemed well supported:

Does anyone know of other style checkers? Are there any user experiences out there?

Some thoughts on Machine Learning Toolboxes

Posted by Mikio Braun on May 24, 2008

One popular format for an open source project in machine learning seems to be the creation of a complete toolbox, providing most of what you need for your everyday machine learning work. Often, such projects are not consciously started but just evolve out of the environment one constructs for oneself. Which is good, as it ensures that included features actually work and are relevant.

Often, such toolboxes use one of the new scripting languages, for example, python together with a scientific toolbox like scipy. PyML and Monte Python are two examples which can be found at The alternatives would either be using something like matlab, which already contains an enormous amount of support for numerical computations, or a compiled language like Java, in which you can build almost everything.

Actually, I think that using a scripting language like python is a huge step in the right direction. All the raw computing power supplied by matlab aside, the programming language used in matlab is already a bit rusty. Well, I know that matlab provides some support for object oriented programming, but the one-file-per-function rule really breaks down when you start assembling objects. Pass-everything-by-value is also quite a headache. (As I was just visiting their website, it seems that they have cleaned up their OOP stuff a bit. But since I haven't checked too closely, and for the sake of the argument, let's just pretend they haven't :).)

So, yes, scripting languages are great because we finally have much more powerful tools for modelling the computational processes which we work with. And an often overlooked fact (from the viewpoint of a toolbox designer) is that machine learning is not just about analyzing data, but also about developing new methods. Interestingly, we apply the same statistical methodology to evaluate methods which we also use to analyze raw data: for example, we assess the methods on resamples of the data when using cross-validation, and we apply statistical tests to see whether a method performs significantly better than the state-of-the-art.

In other words, machine learning research actually closes the loop between data and analyzer in the sense that the methods which we use to analyze data become the object of study themselves (and therefore also the target of statistical analysis). What this means for the underlying programming language is that it must have the capacity to treat methods as objects themselves.

The programming language beneath matlab basically goes as far as allowing function handles (if we forget the OOP part which has been bolted onto the language), but machine learning methods have much more structure than being a function which can take some arguments. For example, most methods have some additional parameters which have to be tuned to achieve good performance. But only if you can talk about a method and its parameters in a natural fashion, you can start to write something like a truly generic cross-validation methods, or a function which takes a bunch of methods and a data set and computes the table of numbers which allows us to compare the methods (and write papers).

All of this is simple in an object oriented language like python. We can implement methods as objects, not just as a collection of train/predict functions, and provide methods for querying all the interesting additional information, and then write methods which work with other methods. (Okay, I admit that this might also be possible in matlab. I have had ideas about how to parse the initial comment in a matlab script file to extract this kind of information, but let's just not go there... .)

I personally think scripting languages are also better prepared for building this kind of abstract framework as opposed to statically typed languages like Java. The reason is that the flexibility of the type system (or a complete lack thereof) allows us to build frameworks which are quite flexible and work with all kinds of objects as long as they provide the right interface. In Java (and I'm just taking it as an example), you would have to build explicit interface hierarchies, which easily results in elaborate class hierarchies containing literally hundreds of classes. In a loosely typed language, you can keep much of this stuff implicit which has the huge benefit that it requires so much less boilerplate to actually use the framework, write new methods and have the framework interact with your code.

One last point, before I take a look at the current state, an important consequence is that method related functions like cross-validation, or other kinds of evaluation procedures, should return an ordinary data set, not a type of object which contains the results, but really the same kind of data structure you use to analyze your usual data. These two steps: Methods as objects and storing the result of a methods assessment again in a data set truly close the loop, and will turn a data analysis toolbox into a machine learning research toolbox.

So what is the current state of the affair? From a quick glance at tutorials and documentation, it seems that most toolboxes are still in the stage where they try to suck in as many machine learning methods as possible, and provide mechanisms for building elaborate data analysis schemes from them. Which is perfectly okay with me, the whole framework described above without any data analysis methods would be pretty useless.

But there are also signs that people are starting to "unlearn" their matlab training and take advantage of the OOP modelling power. Just to name an example, the PyML toolbox provides generic assessment routines which take a classifier object and perform all kinds of data analysis steps on the method!. However, the resulting data is put into a data structure which is actually different from the "normal" data structure. But aside from this minor restriction, this is the direction of which I'd like to see more!

Proposal for NIPS*08 workshop

Posted by Soeren Sonnenburg on May 22, 2008

We are planning to have a NIPS workshop this year again. After our last year's workshop proposal did not get accepted, we thought it is a good idea to publicly discuss our current proposal. We very much invite feedback and are looking for active co-organizers too!

Some thoughts on Open Data

Posted by Cheng Soon Ong on May 15, 2008

At the end of last year, Science Commons announced the Protocol for Implementing Open Access Data which concerns the interoperability of scientific data. bbgm has summarized this into 10 points, of which I would like to focus on the first. Quoting bbgm:

Given the amount of legacy data, it is unlikely 
that a single license will work for scientific data. 
Therefore, the memo focuses principles for 
open access data and a protocol for 
implementing those principles.

Is licensing appropriate for scientific data?

The first knee jerk reaction is to say "Of course! It will protect different people's interests.". However, as pointed out by John Wilbanks, data available in the public domain cannot be made "more free" by licensing, only less. Quoting him:

The public domain is not an “unlicensed commons”. 
The public domain does not equal the BSD. 
It is not a licensing option.
It is the natural legal state of data.

There are several other opinions here and here, but at the end of the day, it is clear that open data is highly important for scientific research, and possibly even more important than open source. My personal view is that for machine learning, public domain seems to be the best for our data.

Taking this idea of "public domain" to the area of software, one can ask the question whether all academic software should be open source. I had the pleasure of spending a few days last week talking to Neil Lawrence and Carl Rasmussen. Neil seems to have software available for each paper that he has recently submitted available on a group webpage. Carl is one of the many people who has contributed to the Gaussian Processes website. The listed projects would be considered (I guess) public domain, or "freely available for academic use". Does it matter that these really useful pieces of software do not have explicit licensing? Should they be considering some form of license?

LWPR is the first application that made it into JMLR-MLOSS.

Posted by Soeren Sonnenburg on May 9, 2008

The Library for Locally Weighted Projection Regression or in short LWPR got accepted in JMLR. We would like to thank the authors for their effort and start to interlink and hi-light accepted JMLR submissions.

MLOSS progress updates for May 2008

Posted by Cheng Soon Ong on May 8, 2008

Here is a bit of self advertising, and a development in the bioinformatics community...

We have, as of today, 68 software projects and 205 registered users on the site ( What surprised me is the breadth of languages that machine learners seem to write their software. A look at the list of languages revealed that most of the popular languages are represented in our list of mloss projects.

  • C, C++
  • clisp, java
  • matlab, octave
  • python, perl
  • R, ruby

Comparing with the most popular programming languages on TIOBE notable languages that are missing include

  • visual basic, php
  • c#, d, delphi
  • javascript

One can argue that many of these languages are more suited to web development than machine learning code, but c# and delphi are general purpose languages. Maybe the fact that they are strongly linked with Microsoft has scared away open source developers from those languages.

In a discussion post, I pointed out that the International Society for Computational Biology was finalizing a policy statement about software sharing, and they recommend open source software. The relevant section says

III. Implementation when software sharing is warranted

  1. In most cases, it is preferable to make source code available. We recommend executable versions of the software should be made available for research use to individuals at academic institutions.
  2. Open source licenses are one effective way to share software.
    For more information, see the definition of open source, and example licenses, at

For the bioinformatics community, this means that researchers can more easily justify to the powers that be that open source is the right way to go. Will the machine learning community follow?

Does the machine learning community need a data interchange format?

Posted by Cheng Soon Ong on March 16, 2008

While in principle, a standardized data format seems like a good thing, in practice, everyone seems to write their own little data parser. It seems that machine learning researchers find it too troublesome to agree on a standard format. For simple tabular style data, delimited ascii based formats may have the right tradeoff between human readability and efficiency.

For example, the UCI Machine Learning Repository uses simple comma seperated values, with one example per row. Additional information, such as which column contains the label is given in another file.


An alternative to this is a sparse ascii format, where each entry is now an (index,value) pair. LIBSVM uses such a format.

1 1:0.68421 2:-0.616601 3:0.144385 4:-0.484536 5:0.23913
1 1:0.142105 2:-0.588933 3:-0.165775 4:-0.938144 5:-0.347826
1 1:0.121053 2:-0.359684 3:0.40107 4:-0.175258 5:-0.326087
1 1:0.757895 2:-0.521739 3:0.219251 4:-0.360825 5:-0.0652174
1 1:0.163158 2:-0.268775 3:0.614973 4:0.0721649 5:0.0434783

Another possibility is instead of just having a delimited file, there can be an additional 'header' section of the file where metadata is defined. Weka uses the so called ARFF format which has an additional header section before the tabular data begins. Interestingly, there does not seem to be a formal definition of this data format, but instead, Weka defines the format via a set of examples. Recently, an ANTLR definition along with a python implementation of the corresponding lexer/parser has appeared.

Lastly, for those who find ascii too inefficient, there is HDF5 which claims to be highly scalable.

However, the original question remains, do we need to agree on one format, and if so, what should it be? Join our discussion!