Open Thoughts

June 2009 archive

Open Science in Machine Learning

June 16, 2009

I am giving an invited talk on mloss at the ICML Workshop on Evaluation Methods in Machine Learning, 2009. I am experimenting with the idea of writing a blog about my ideas just before giving the talk. Perhaps some of the 167 people who apparently read this blog, are at ICML and are still on the fence about which workshop to attend, will come to my talk. But more importantly for me, perhaps some of the people who see my talk can give me written feedback as comments to this blog.

The abstract of the talk is as follows:

Openness and unrestricted information sharing amongst scientists have been identified as values that are critical to scientific progress. Open science for empirical machine learning has three main ingredients: open source software, open access to results and open data. We discuss the current state of open source software in machine learning based on our experience with mloss.org as well as the software track in JMLR. Then we focus our attention on the question of open data and the design of a proposed data repository that is community driven and scalable.

The main theme of the talk is that open science has three main ingredients:

  • Open Access
  • Open Source
  • Open Data

After a brief introduction to open access and open source and how it is very nice, I will give a (totally biased) historical overview of how mloss has developed. Basically, the three workshops, mloss.org, and JMLR. The three main ingredients for open science in machine learning translates to:

  • The paper should describe the method clearly and comprehensively.
  • The software that implements the method and produces the results should be well documented.
  • The data from which the results are obtained is in a standard format.

The argument we have got into time and again is that openness is actually not a requirement for scientific research. Papers do not have to be open access, even though there is evidence showing its benefits. For reproducible experiments, software can be distributed as binary black boxes. Of course, one cannot extend software to solve more complex tasks without access to the source code. And data can held in confidence even after the resulting paper has been published. Ironically, one can publish an open access paper without disclosing the data. We believe that being open is the best way to perform scientific research, and if the evidence does not convince you, you can consider it a moral choice. We envision three independent but interoperable components: the data, the paper, and the software, instead of a monolithic system such as sweave.

However, one has to be a bit more precise when considering the data blob above. Most of the projects currently on mloss.org actually "just" implement an algorithm or present a framework. To obtain a particular result, there are many details which do not fit nicely into the "Let us write a general toolbox for ..." mindset. We believe that a data repository should not only contain datasets like currently available repositories such as UCI and DELVE. Instead, it should host different objects:

  • Data Data available in standard formats (Containers). Well defined API for access (Semantics).

  • Task Formal description of input-output relationships. Method for evaluating predictions.

  • Solution Methods for feature construction. Protocol for model selection.

The details of the Data part have been strongly influenced by the discussion we have here. The other objects are still not so well defined.

In summary, we think open science benefits the community as a whole. For the individual, it increases visibility and broadens audience for your problems and solutions. For software, it improves extendability and usability. However a data repository is missing, for machine learners to exchange tips and tricks for dealing with real problems. We believe that For machine learning to solve real prediction tasks, we need to have a common protocol for data communication.

Let us know your comments and suggestions on how to achieve open science.

Open Source in Astronomy

June 4, 2009

It seems that researchers in astronomy have also realised the benefits of open source. A group of scientists have published a manifesto which has the same views as the position paper published by machine learners. From the abstract of the astronomers' statement:

We advocate that:

  1. the astronomical community consider software as an integral and fundable part of facility construction and science programs;
  2. that software release be considered as integral to the open and reproducible scientific process as are publication and data release;
  3. that we adopt technologies and repositories for releasing and collaboration on software that have worked for open-source software;
  4. that we seek structural incentives to make the release of software and related publications easier for scientist-authors;
  5. that we consider new ways of funding the development of grass-roots software;
  6. and that we rethink our values to acknowledge that astronomical software development is not just a technical endeavor, but a fundamental part of our scientific practice.

Now isn't that cool?