Open Science in Machine Learning

Posted by Cheng Soon Ong on June 16, 2009

I am giving an invited talk on mloss at the ICML Workshop on Evaluation Methods in Machine Learning, 2009. I am experimenting with the idea of writing a blog about my ideas just before giving the talk. Perhaps some of the 167 people who apparently read this blog, are at ICML and are still on the fence about which workshop to attend, will come to my talk. But more importantly for me, perhaps some of the people who see my talk can give me written feedback as comments to this blog.

The abstract of the talk is as follows:

Openness and unrestricted information sharing amongst scientists have been identified as values that are critical to scientific progress. Open science for empirical machine learning has three main ingredients: open source software, open access to results and open data. We discuss the current state of open source software in machine learning based on our experience with mloss.org as well as the software track in JMLR. Then we focus our attention on the question of open data and the design of a proposed data repository that is community driven and scalable.

The main theme of the talk is that open science has three main ingredients:

Open Access
Open Source
Open Data

After a brief introduction to open access and open source and how it is very nice, I will give a (totally biased) historical overview of how mloss has developed. Basically, the three workshops, mloss.org, and JMLR. The three main ingredients for open science in machine learning translates to:

The paper should describe the method clearly and comprehensively.
The software that implements the method and produces the results should be well documented.
The data from which the results are obtained is in a standard format.

The argument we have got into time and again is that openness is actually not a requirement for scientific research. Papers do not have to be open access, even though there is evidence showing its benefits. For reproducible experiments, software can be distributed as binary black boxes. Of course, one cannot extend software to solve more complex tasks without access to the source code. And data can held in confidence even after the resulting paper has been published. Ironically, one can publish an open access paper without disclosing the data. We believe that being open is the best way to perform scientific research, and if the evidence does not convince you, you can consider it a moral choice. We envision three independent but interoperable components: the data, the paper, and the software, instead of a monolithic system such as sweave.

However, one has to be a bit more precise when considering the data blob above. Most of the projects currently on mloss.org actually "just" implement an algorithm or present a framework. To obtain a particular result, there are many details which do not fit nicely into the "Let us write a general toolbox for ..." mindset. We believe that a data repository should not only contain datasets like currently available repositories such as UCI and DELVE. Instead, it should host different objects:

Data Data available in standard formats (Containers). Well defined API for access (Semantics).
Task Formal description of input-output relationships. Method for evaluating predictions.
Solution Methods for feature construction. Protocol for model selection.

The details of the Data part have been strongly influenced by the discussion we have here. The other objects are still not so well defined.

In summary, we think open science benefits the community as a whole. For the individual, it increases visibility and broadens audience for your problems and solutions. For software, it improves extendability and usability. However a data repository is missing, for machine learners to exchange tips and tricks for dealing with real problems. We believe that For machine learning to solve real prediction tasks, we need to have a common protocol for data communication.

Let us know your comments and suggestions on how to achieve open science.

Comments

Josh Reich (on June 19, 2009, 09:08:39)

I'm a great fan of the site and share very similar goals to the three you mentioned with the little web service I launched earlier this week - http://predict.i2pi.com/ - and I just put up a blog post explaining my motivation at http://blog.i2pi.com/.

While developing this I spent some time investigating PMML. It seems very heavy handed for the flexibility that it is trying to achieve and appears to be only currently implement in closed source software. But I'm curious to hear your thoughts.

Olivier Grisel (on June 21, 2009, 23:47:38)

In the same spirit, it would be great if the mloss project could setup and maintain an up to date benchmark evaluation of the latest versions of the common open source implementation of machine learning algorithms.

The goal would be to mimic the great language shootout hosted by the debian project that aims at comparing compiler, interpreter and VM implementations: http://shootout.alioth.debian.org/

The great machine learning shootout could start by comparing opensource classifiers and regressors (SVM, neural networks libraries, classification trees, ...) on common oopensourcepen access datasets using CPU, memory and time bounds.

The CPU and memory bounds could be defined by the smallest instance available on Amazon EC2.

Patrik Hoyer (on June 24, 2009, 12:19:49)

Hi everyone,

Thanks Cheng for your post to start this thread.

For the benefit of those following this discussion, let me start by providing a little bit of context for what is to follow: The Pascal2 Network of Excellence (funded by the EU) is looking to fund the construction of a dataset/benchmark repository for the machine learning community. An initial call was issued in December 2008, and proposals were due at the end of January 2009. The decision was to suggest to combine the proposal submitted by the Helsinki team (Hoyer, Lahtinen, Tonteri, Myllymaki, Ukkonen, and Mannila) with the proposal submitted by the mloss.org team (Mikio Braun, Soeren Sonnenburg, Cheng Soon Ong).

Thus, our current task is to discuss the various options and produce a joint plan, combining the strong points of the two proposals. Given that the goal is to produce a service promoting open machine learning, Cheng's suggestion to have an open discussion here on mloss.org (inviting comments and opinions from the community) is obviously a good one.

For concreteness, I provide here the original proposal of the Helsinki team. (Needless to say, the proposed timetable and the budget are no longer directly relevant as the project will not be carried out in its present form, but the general ideas are still valid and many if not most would fit well in a joint proposal. Please feel free to provide comments, preferrably directly on this forum but alternatively by email.) Perhaps it might be a good idea if the mloss.org team's original proposal would be available here as well?

Some of the main ideas of our Helsinki-based team was:

Challenges/tasks are precisely specified: i.e. exactly what is the input, what is the required format of the output, and how is the output scored (that is, precisely defining what is a good result)?
The evaluation of the results is performed on the server, but researchers run their own machine learning algorithms (regression, classification, unsupervised learning, etc) on their own platforms/computers.
Community-driven: Similarly to wikipedia, all content of the system (including the categories and other structure) can be entered and maintained/updated by all members of the community (though some form of login would be required)
Full version history of all the material (this is pretty much essential if all content can be updated/maintained by anyone)
All content on the system easily downloadable by anyone, to make sure that the data and all material is 'free' and not hostage to any one particular site/server.

In a nutshell, our plan was to take an off-the-shelf content management system, and only add the missing part (evaluation of results on the server). But other solutions (such as the platform on which the mloss.org site is built) may work well too.

Cheng Soon Ong (on June 24, 2009, 21:41:48)

As Patrik mentioned in the previous post, our original proposal (myself with Soeren and Mikio) is heavily biased by our experience with building mloss.org. There are several common themes in both proposals, and the difference mainly coming from the assumed underlying infrastructure.

We believe that for the new data repository to remain fresh, it has to be community driven. That is all items (data, tasks, solutions) in the repository should be contributed by the community. Anyway, let us know what you think.

Soeren Sonnenburg (on June 25, 2009, 00:13:30)

I think the most important thing to have is a mere collection of datasets. We should really try to aggregate as many as possible from all the many different places and then release an initial version to let the site evolve in a community driven way.

In a second step I would really love to see a challenge like evaluation system (like I wrote for thelarge scale learning challenge ). We can provide a variety of scalar performance measures for that and accept submissions of patches for new ones...

Many other things are possible but I would first want to focus on these two points.

Patrik Hoyer (on June 25, 2009, 08:27:44)

Regarding Soeren's comment: I think "merely" having a collection of datasets would be useful, but would not fundamentally change the way machine learning methods are evaluated and compared. It would only make it somewhat easier for us to find the type of datasets which are already out there. Thus, to me it seems the crucial part really is the exact problem/task definitions: in essence 'permanent' challenges/benchmarks. If I'd have to choose between one or the other, I would definitely pick having a repository of such problem definitions/benchmarks over having a repository of datasets. Fortunately, of course, we can try to construct a repository containing both. One could of course focus on one of the two first, then the other (for example, first gathering a large set of data, then gathering machine learning benchmarks using those data) but I don't see any obvious benefit to doing it this way. In fact, don't a lot of current and existing challenges (such as those sponsored by Pascal) already contain an element of both (i.e. there is some new data, and a specific task to apply to it). Thus it seems simplest to work with the two aspects jointly, no?

I completely agree on the comment regarding the evaluation system: The server should accept "result files" provided by the researchers and score these according to whatever the challenge/benchmark is (and provide a list of current results). One perhaps somewhat controversial (?) thought is that I think it is not that important to always withhold "test" data. If there is enough data, enough tasks, and if researchers can have a look at each other's code and methods, I don't think overfitting is all that serious a problem. That is, you know there is something sneaky going on when your colleague's code begins with "set.random.number.generator.seed(7826715)".

Soeren Sonnenburg (on June 25, 2009, 12:32:10)

To me a data set collection contains everything what you describe above already except the the evaluation part (e.g. problem/task definition). And I want both either. So I don't see any contradiction here - which is good.

I guess the reason why we have a misunderstanding here is that I am thinking more in terms of milestones here and the "mere" collection is simply the first step. It sounds a bit like you think that this problem is more or less solved already but I don't think so - it is a lot of work even when starting with a full wiki or the mloss.org code. And it is a big contribution already as it is really hard currently to find suitable datasets. A central repository that categorizes things etc will already help a lot here.

To me the only parts missing to get evaluations working is

definition of evaluation score and meaning
upload of outputs to compute scores on the server
display of results associated with dataset/method

And when we have the repository ready us and the community can create new datasets while we develop the evaluation system.

Cheng Soon Ong (on June 25, 2009, 14:15:46)

Chris Drummond disagrees with our aim of having such rigor for evaluating machine learning methods. His view is explained in the following paper which says that replication is not something worth aiming for. I tried to convince him otherwise...

About Soeren's point of aggregating datasets, I think that in addition to just aggregating right at the start, we should do this in a continuous fashion. For example, UCI gets new datasets every now and again, and this update should be reflected in our repository too. This "slurping" should not be too hard to achieve, based on what we are currently doing with CRAN. For static data repositories like the Large Scale Challenge, of course, we just grab the data once.

I'd like to move the discussion forward a bit by proposing something. Let us not get bogged down with nitpicking the definitions. Is the following structure (based on the Helsinki proposal) ok?

Data Data available in standard formats (Containers). Well defined API for access (Semantics).
Task Formal description of input-output relationships. Method for evaluating predictions.
Solution Methods for feature construction. Protocol for model selection.

If it is ok, then we can drill down and define the internals.

Soeren Sonnenburg (on June 25, 2009, 14:34:39)

We will have to see if it is easily possible to automagically fetch new datasets from UCI or other repositories - but it is a good idea :)

Regarding the structure, I am fine with Data/Tasks but for Solution I would have expected some ML method. I am not sure about where model selection belongs (part of the ML method?) either.

Patrik Hoyer (on June 26, 2009, 13:20:15)

I think Chris Drummond makes a good point: What he calls replicability (i.e. someone else re-running the full code to perfectly re-create the exact figures in a published paper) is not sufficient for the machine learning community to endorse any one method or idea. Rather, a method needs to be tested on a number of datasets, and a theoretical idea applied in many different contexts, for it to become part of the accepted theory and practice in the field. However, that does not mean that 'replicability' is a bad thing! On the contrary, I would think it would facilitate 'reproducing' results (using Drummond's terminology), as it would be much easier for researchers to apply each other's methods to new data and in new contexts.

Furthermore, I believe one of the main benefits of fully open, 'replicable' results is one of transparency. Sure, it is possible to write completely illegible code, effectively hiding all of the critical steps in the method; but most researchers take pride in, and are very careful about the details of the work that they do, particularly so regarding papers, notes, and comments submitted to public forums. If code was public as well, there would be a stronger incentive for researchers to be just as meticulous about the code they write, which in my view can only improve the current state of the field of machine learning.

Regarding datasets vs benchmarks (continuing our thread with Soeren), I think we largely agree; both are needed. My point is mainly that I think we ought to think about and collect both simultaneously, because they often go hand-in-hand. Thus, first focusing on one and then the other is not in my mind the best way to proceed.

Finally, on the data/task/solution division, I agree. For me, a solution consists of whatever is needed (besides the data) to replicate (in Drummond's terms) the exact results on a given task. This means a full description of what software was used, on what platform, with what parameters; any preprocessing that was done to the data, etc etc: Everything needed to exactly produce any given results. Most easily and clearly, this is given by computer code, leaving only a description of what software has to be installed for it to run. Note that the top level function of the computer code should not accept any parameters!

Cheng Soon Ong (on June 26, 2009, 18:31:22)

Just to drill down some more, here is an example of what I meant by data, task and solution (agreeing with Patrik).

Data

MNIST data.
A set of images with corresponding labels.

Task

Multiclass classification with $k=10$.
F1-score or TP1FP

Solution

Multiclass multiple kernel learning, with Gaussian widths [0.1, 0.5, ...]
5 fold CV on 70% of the data.

Here are also some more notes I had from my ICML Workshop talk.

Data Containers

For static data, there are several common formats: CSV, ARFF, netCDF, HDF5, ODBC
An implementation of ARFF readers and writers in various languages is available here
Should also store data permutations.

Data Semantics

Descriptions of sources and usage scenarios.
Metadata such as the number of examples, or the type of each feature.
Task related information such as which column contains the label, and whether to hold out labels.

Task

Usually defined with an example dataset and solution.
Challenge organisers may determine which datasets are of interest.
There may be many Tasks associated with a particular dataset.

Solutions

Assume (!?) that software is at mloss.org
Data preprocessing (feature construction).
Classifier architectures, e.g. hierarchy of SVMs.

Most importantly, It should be easy for users to contribute data, tasks and solutions to the repository.

Soeren Sonnenburg (on June 28, 2009, 14:40:22): Apart from the software has to be at mloss.org I am fine with that structure :)

Patrik Hoyer (on June 29, 2009, 10:00:52)

To the previous concrete example, I would add the following (perhaps these were implied, but better to be explicit to really make it clear):

Data: The MNIST data (a set of images with corresponding labels), in some more or less standard format A free-form (plain English) description of the data: what format it is in, how it was obtained and for what purpose, with links to the original sources, papers which have introduced and described it, etc. (Obviously, this free-form information would be collected collaboratively as time goes by, users adding relevant information as they see fit. No need for it to be perfect when the data is first uploaded/added...)

Task: Free-form description of the task: Multiclass classification with $k=10$, F1-score and TP1FP. Definition of data permutations/division into train/test, perhaps easiest to implement as a script which loads the data and outputs it in a form suitable for direct input to a classifier. Users may contribute scripts for different languages (R, matlab, python, ...). Exact specification of the file format of the "result file", for example: labels (0-9) for the test data, separated by commas, no spaces. Script on server for returning the F1-score and TP1FP given that user uploaded a correctly formatted "result file".

Solution: Code for generating the "result file" from whatever input is provided by the task definition/data, probably as a link to some specific version of some specific software package (for example at mloss) and additional uploaded scripts that perform any additional preprocessing or feature extraction and set parameters etc. Free-form description of the solution: (in this example, multiclass multiple kernel learning, gaussian widths selected by 5-fold cross validation on the training data). * Numerical scores obtained on the given tasks that this solution solves.

Does this sound reasonable to everyone? I completely agree that it is important that it is easy to add material to the repository. From my point of view, this is an argument for not-too-strict structure requirements (free-form or relatively adaptive structures) but this then means that it must be possible for others to improve on the material further (i.e. not locking the material so that only the original submitter can change it).

I agree with Soeren that we probably cannot require that all software has to be at mloss. For instance, Matlab can probably not be put on mloss :) and there is certainly going to be other more "intermediate" cases where some commercial software is utilized in the solutions. I think we would be cutting out a huge part of the potential users if we made a strict requirement that everything has to be open-source/free.

Patrik Hoyer (on June 29, 2009, 10:03:19): The bulleted lists in my previous post (under data, task, solution) did not quite turn out right, I guess I should study the Markdown syntax before any further posts :) Hopefully it is still understandable.

Cheng Soon Ong (on June 30, 2009, 14:27:20)

It sounds like we are converging to something with respect to the design of the internals. So, how about having the following fields for each of the objects? Only those labeled by (required) are necessary, the rest can be included when someone has time. We may even wish to provide at a later date, auto detection for some of the fields.

The design below seems a bit too detailed now. Is there anything other than these 3 objects that needs discussion? Otherwise, I think we can go ahead with the proposal.

Common to all

papers
external urls
free form description
version
backlinks?

Data

container format
source url
measurement details
usage scenario
links to task
links to solution

Task

input format
output format
performance measure
links to data
links to solution

Solution

numerical scores for given data and task
computational pipeline

Patrik Hoyer (on June 30, 2009, 14:54:19)

Cheng, it seems the (required) -labels were left out of your list. Otherwise I think it seems good.

As for other things needing discussion, what do you (Cheng, Soeren, and others) think of each of the 5 ideas given in the list in my earlier post (June 24, 2009, 12:19:49)? Arguments for/against?

Cheng Soon Ong (on June 30, 2009, 15:12:21)

I think we can discuss exactly which are the minimum required fields later.

The 5 ideas:

Challenges/tasks are precisely specified
The evaluation of the results
Community-driven
Full version history of all the material
All content on the system easily downloadable

My thoughts:

Given the structure above, tasks should be precisely specified. I think for the first prototype, we should leave out challenge creation.
It is not clear to me how to achieve this. At best, we should have scripts on the server in most of the major languages to compare "results" to "labels".
Yes.
Yes.
Yes.

Cheng Soon Ong (on July 5, 2009, 10:16:03)

There are two recent blog posts on open science in general...

http://freelancingscience.com/2009/07/02/open-science-a-step-towards-open-innovation/
http://hmrx.posterous.com/calling-all-non-scientists

Chris Drummond (on July 10, 2009, 22:38:05)

I had promised Soeren that I would post a short version of my argument so we could debate it in this forum. As Cheng Soon kindly points out there is a longer version available.

One compelling argument for repositories such as mloss is reproducibility. Reproducibility of experimental results is seen as a hallmark of science. By collecting all the artifacts used in the production of the experimental results reported in a paper would, it is claimed, guarantee reproducibility. Although not explicitly stated, the subtext is that if we have any pretensions of being scientists then we have little choice but to do this.

My counter argument is that this view is based on a misunderstanding of reproducibility in science. What mloss will allow people to do is replicate experiments, but this is not reproducibility. Reproducibility requires changes; replicability avoids them. Reproducibility's power comes from the differences between an original experiment and its reproduction. The greater the difference the greater the power. One important role of an experiment is to support a scientific hypothesis. The greater the difference of any subsequent experiment the more additional support garnered. Simply replicating an experiment would add nothing, except perhaps to confirm that the original was carried out as reported. To me, this is more of a policing exercise than a scientific one, and therefore, I would claim, of much reduced merit.

Cheng Soon Ong (on July 12, 2009, 21:55:07)

Here are a few more links which may be useful for our work on the data repository:

Open Knowledge Definition, which follows closely in spirit to the Open Source Definition, but applies to data and knowledge.
Peter Skomoroch has a nice collection of links to datasets. Not all are open and/or suitable for machine learning, but it would be a useful source nevertheless.
If we indeed choose to follow a Django based infrastructure and would like to have version control, there is a package called django-reversion which seems to do the trick.

Marcin Wojnarski (on August 21, 2009, 15:55:57)

Hi All,

I've read your discussion with real pleasure. It touches the hard problems of machine learning which make research very difficult (as I experienced myself) and suppress the advance of ML: lack of reproducibility of experiments, low interpretability of results, weak collaboration. There is huge amount of research being done, half of contemporary informatics is somehow related to ML, there's immense demand from industry... But my feeling is that no one really knows which methods are better for a given task - there are so many competing algorithms, with so many different variants each, while the standards and tools for their evaluation and comparison are lacking. Consequently, it's easier to come up with a brand new algorithm than to precisely evaluate an existing one.

Recently we created a system, named TunedIT (http://tunedit.org/), which tries to address these problems. TunedIT might be interesting for you, as it employs many of the ideas mentioned above: sharing of algorithms, datasets and methods of evaluation, downloadable contents, descriptions of resources, it's community-driven etc. One thing that didn't appear in the discussion but is very important for reproducibility and interpretability of results is the automated evalution of algorithms. In TunedIT there's an application for this purpose, TunedTester, which can be used by all registered users to run automated tests of algorithms and optionally submit results to TunedIT database.

I hope TunedIT would be useful in your research. I invite you to try it and to contribute new algorithms, datasets and results. If you find TunedIT interesting, please help us improve it by sharing your comments.

Cheng Soon Ong (on August 31, 2009, 14:43:01)

Hi Marcin,

I particularly like the idea of TunedTester, which allows the user to evaluate algorithms on particular datasets using their local machine. I may have missed it, but it would be nice if the user could also browse the source of TunedTester, since it would explicitly show how it works. (Of course, this shows my bias for OSS). An advantage for you would be that perhaps some user would provide implementations in other languages, although replicating the java sandbox may not be that easy.

One thing which we have discussed in this forum is to develop a system that makes it easy for users to upload their own methods, data, evaluations. Do you have any thoughts about this? As far as I can tell, a user cannot evaluate a method that is not currently listed in Tunedit. Would it not be nice for you to have the machine learning community helping you increase your list of available algorithms, datasets, evaluations, etc?

One other minor note. You may wish to guide the new user a bit about the available algorithms. The blank text boxes assume that the user knows the exact name that you use, which can be a bit frustrating for the newbie.

Keep up the great work!

Marcin Wojnarski (on September 1, 2009, 13:39:29)

Cheng, thanks a lot for comments.

You're right that help of community is indispensable to build rich and useful repository. This is why TunedIT does allow users to upload new resources: algorithms, datasets, evaluations etc. - the user must be registered and logged in to do this. Upon registration, the user is given a home folder in Repository, where he can upload new files. Afterwards, the resources can be used in TunedTester in the same way as preexisting ones, by giving their full name (access path) in Repository.

As to TunedTester, sources are not available due to security reasons - the primary concern in the design of TunedIT was how to assure that results collected in Knowledge Base are valid and trustworthy, even if the users who submitted them cannot be trusted. If we released sources of TT then everyone could modify its behavior and disturb the results.

It's worth to add that TT does not calculate test results by itself. Rather, it's the evaluation procedure which is responsible for all the details of evaluation setup and calculation of a quality measure. Source code of existing evaluation procedures created by us - ClassificationTT70 and RegressionTT70 - is available in Repository, so everyone can check how the test is executed and what is actually calculated at the end. Evaluation procedures are downloaded by TT from Repository and are "pluggable" - new ones can be implemented by any user.

As to blank boxes, we'll try to make it easier. Surely, the user may feel disoriented right now.

You must be logged in to post comments.

Latest Thoughts

Archive

Open Science in Machine Learning

Comments

Data

Task

Solution

Data Containers

Data Semantics

Task

Solutions

Common to all

Data

Task

Solution

Leave a comment