Open Thoughts

Replicability is not Reproducibility: Nor is it Good Science

Posted by Chris Drummond on July 13, 2009

I had promised Soeren that I would post a short version of my argument so we could debate it in this forum. As Cheng Soon kindly points out there is a longer version available.

One compelling argument for repositories such as mloss is reproducibility. Reproducibility of experimental results is seen as a hallmark of science. By collecting all the artifacts used in the production of the experimental results reported in a paper would, it is claimed, guarantee reproducibility. Although not explicitly stated, the subtext is that if we have any pretensions of being scientists then we have little choice but to do this.

My counter argument is that this view is based on a misunderstanding of reproducibility in science. What mloss will allow people to do is replicate experiments, but this is not reproducibility. Reproducibility requires changes; replicability avoids them. Reproducibility's power comes from the differences between an original experiment and its reproduction. The greater the difference the greater the power. One important role of an experiment is to support a scientific hypothesis. The greater the difference of any subsequent experiment the more additional support garnered. Simply replicating an experiment would add nothing, except perhaps to confirm that the original was carried out as reported. To me, this is more of a policing exercise than a scientific one, and therefore, I would claim, of much reduced merit.

Comments

Soeren Sonnenburg (on July 14, 2009, 00:03:56)

You might wonder that I very often had hard times simply to replicate results. In case mloss.org provides just that I would be happy, even though there are other goals...

Tom Fawcett (on July 15, 2009, 07:24:35)

Chris, I read your paper and I still don't think I understand what you're talking about re ML. With all due respect, you seem to talk around the problem philosophically without ever describing clearly what the problem is, especially WRT ML. I haven't read the Sonnenburg et al. paper and I wasn't at the workshops, so maybe everyone else already knows what you're talking about.

What I don't understand yet is:

  • What is an experiment in ML? (This is not a trivial question; I can imagine several very different answers.)

  • What does it mean to replicate an ML experiment? Same software, same data, same parameters, same machines?

  • What does it mean to reproduce the ML experiment? How is this fundamentally different from replicating it? How much should need to be changed, and why, and how does this have an effect the scientific value?

  • When researchers complained that experiments were not replicable, what were they complaining about? (no software? no data? parameter settings unspecified? or that they performed the experiments but got different results than reported?)

Even a single example of this would help.

Chris Drummond (on July 17, 2009, 20:59:26)

Tom, let me take the fairly recent example of an idea in machine learning, semi-supervised learning. The idea is that unlabeled examples can improve the performance of classification algorithms. I am not sure who would be considered the originator of this idea but early experiments demonstrated, at least in some circumstances, unlabeled examples were useful.

So let me try to address your questions.

What is an experiment in ML? I agree it is not a trivial question and there are many answers to it. In this example, its purpose is to support the contention that "unlabeled examples can improve the performance of classification algorithms". I am not sure what is historically the case, but we might take various data sets, a single algorithm and show how these additional unlabeled examples help under some performance measure.

What does it mean to replicate an ML experiment? The ideal, it seems to me, is to do exactly the same experiment. Same software, same data, same parameters, same machines? Yes, if possible but I think because of the idea of virtual machines most people would not insist on the same physical machine.

What does it mean to reproduce the ML experiment? At risk of appearing pedantic let me rephrase your question to "What does it mean to reproduce the ML experimental result?" If the result is to support the idea that "unlabeled examples can improve the performance of classification algorithms", then replication would add no additional support. What we should do instead is to use "different software, different data, different parameters, different machines". In fact, I would claim the greater the differences, the more additional support gained. Using many more algorithms and many more data sets would be additional evidence for the claim that "unlabeled examples can improve the performance of classification algorithms".

When researchers complained that experiments were not replicable, what were they complaining about? They got different results than reported.

So, although I accept that there is a problem with replication, I am far from convinced it is worth putting a lot of effort into addressing it. What I certainly feel we should try to avoid is requiring that papers are only accepted to certain venues if the experiments are replicable. I think this would increase reviewing load, lead to less interesting papers and generally be counter-productive to our field of research.

Leave a comment

You must be logged in to post comments.