Open Thoughts

January 2010 archive

Data and Code Sharing Roundtable

January 26, 2010

As pointed out by the authors of the mloss position paper [1] in 2007, "reproducibility of experimental results is a cornerstone of science." Just as in machine learning, researchers in many computational fields (or in which computation has only recently played a major role) are struggling to reconcile our expectation of reproducibility in science with the reality of ever-growing computational complexity and opacity. [2-12]

In an effort to address these questions from researchers not only from statistical science but from a variety of disciplines, and to discuss possible solutions with representatives from publishing, funding, and legal scholars expert in appropriate licensing for open access, Yale Information Society Project Fellow Victoria Stodden convened a roundtable on the topic on November 21, 2009. Attendees included statistical scientists such as Robert Gentleman (co-developer of R) and David Donoho, among others.

The inspiration for this roundtable was the leadership of the genome research community in establishing the open release of sequence data. Representatives from that community gathered in Bermuda in 1996 to develop a cooperative strategy both for genome decoding and for managing and sharing the resulting data. Their meeting resulted in the "Bermuda Principles" [13] that shaped the ensuing data sharing practices among researchers and ensured rapid data release. In the computational research community more generally the incentives and pressures can differ from those in human genome sequencing; consequently, the roundtable sought to consider the issues in a larger context. A second goal of the workshop was to produce a publishable document discussing reactions to data and code sharing in computational science. We also published short topical thought pieces [14] authored by participants, including by statistical scientists [15-16], raising awareness of the issue of reproducibility in computational science.

The Data and Code Sharing Roundtable adapted the focus of the genomics community to include access to source code as well as data, across the computational sciences. This echoes mloss's call for "the supporting software and data" to be openly distributed through the mloss repository with links to alternatively stored data collections. The Yale roundtable was organized in five parts: framing issues, examining legal barriers and solutions, considering the role of scientific norms and incentives, discussing how technological tools help and hinder sharing, and finally crafting key points for release in a statement. The agenda is available online [17] with links to each session's slide decks.

The first session framed issues across the disparate fields and was moderated by Harvard Astronomy Professor Alyssa Goodman, and featured presentations from Mark Gerstein, the Albert L. Williams Professor of Computational Biology and Bioinformatics at Yale, Randy LeVeque, the Founders Term Professor of Applied Mathematics at the University of Washington, and Alyssa Goodman herself. The second session was moderated by Frank Pasquale, the Loftus Professor of Law at Seton Hall University, and discussed legal barriers to the sharing of research codes and data and presented alternate licensing frameworks to enable sharing. Pat Brown, Professor of Biochemistry at Stanford University, moderated the session on norms and incentives, leading a discussion of publishing models, peer review, and reward structures in the scientific community. The session on computational solutions was moderated by Ian Mitchell, Computer Science Professor at the University of British Columbia, and examined computational solutions (see for example Matt Knepley's slides from that session). The final session summarized findings and recommendations to be drafted into a jointly authored published statement. The organizers are in the process of creating this "position statement," compiled from the discussions at the workshop and from "thought pieces" contributed by attendees.

We invite members of to consider contributing such a thought piece, and hope that the open source community within machine learning will find the thought pieces, slides, or position statement useful in promoting distribution of source code as part of the scientific publication process and promoting reproducible computational science more generally.


Victoria Stodden
Yale Law School, New Haven, CT
Science Commons, Cambridge, MA

Chris Wiggins
Department of Applied Physics and Applied Mathematics,
Columbia University, New York, NY


  • [1] Sonnenburg, "The need for open source software in machine learning" Journal of Machine Learning Research, 8:2443-2466, 2007;
  • [2] Social science: Gary King, the Albert J. Weatherhead III University Professor at Harvard University, has documented his efforts in the social sciences at his website He also runs The Dataverse Network, a repository for social science data and code;
  • [3] Geophysics: Stanford Geophysics Professor Jon Claerbout's efforts in Geoscience:;
  • [4] Geophysics: University of Texas at Austin Geosciences Professor Sergey Fomel's open source package for reproducible research, Madagascar:;
  • [5] Signal processing: Signal Processing at Ecole Polytechnique Federale de Lausanne, Reproducible Research Repository; including Vandewalle, Patrick and Kovacevic, Jelena and Vetterli, Martin (2009) "Reproducible Research in Signal Processing - What, why, and how" IEEE Signal Processing Magazine, 26 (3). pp. 37-47 (;
  • [6] Databases: The database community tested replication in SIGMOD 2009 submissions; cf. I. Manolescu, L. Afanasiev, A. Arion, J. Dittrich, S. Manegold, N. Polyzotis, K. Schnaitter, P. Senellart, S. Zoupanos, D. Shasha, et al. "The Repeatability Experiment of SIGMOD 2008" SIGMOD Record, 37(1):39, 2008;
  • [7] Databases: R.V. Nehme. "Black Hole in Database Research";
  • [8] Climate: "Please, show us your code" RealClimate, Rasmus E. Benestad;
  • [9] Economics: BD McCullough. "Got replicability?" The Journal of Money, Banking and Credit Archive. Econ. Journal Watch, 4(3):326-337, 2007;
  • [10] Linguistics: T. Pedersen. "Empiricism is not a matter of faith" Computational Linguistics, 34(3):465-470, 2008.;
  • [11] Computational Biology: Jill P. Mesirov. "Accessible Reproducible Research" Science 22 January 2010: Vol. 327. no. 5964, pp. 415 - 416;
  • [12] General sources on reproducibility: and;
  • [13] "Bermuda Rules: Community Spirit, With Teeth" Science 16 February 2001: Vol. 291. no. 5507, p. 1192;
  • [14] Thought pieces available via;
  • [15] "Reproducible research and genome scale biology: approaches in Bioconductor" Vincent Carey and Robert Gentleman,;
  • [16] "View Source" Chris Wiggins;
  • [17] Agenda for roundtable available via

The Open Source Process and Research

January 13, 2010

(Cross posted on

I think there is more to be learned from the open source software development process than just publishing the code from your papers. So far, we've mostly focused on making the software side more similar to publishing scientific papers, for example, through creating a special open source software track at JMLR.

However, there is more to be learned from the open source software development process:

  • "Release early, release often" Open source software is not only about making your software available for others to reuse, but it is also about getting in touch with potential users as early as possible, as closely as possible.

Contrast this with the typical publication process in science where there lie months between your first idea, the submission of the paper, its publication, and the reactions through follow-up and response papers.

  • Self-organization collaboration One nice thing about open source software is that you can often find an already sufficiently good solution for some part of your problem. This allows you to focus on the part which is really new. If existing solutions look sufficiently mature and their projects healthy, you might even end up relying on others for part of your project, which is really interesting given that you don't even know these people or have ever talked to them. But if the project is healthy, there is a good chance that they will do their best to help you out, because they want to have users for their own project.

Again, contrast this with how you usually work in science, where it's much more common to collaborate with people from your group or people within the same project only. Even if there were someone working on something which would be immensely useful for you, you wouldn't know till months later when their work is finally published. The effect is that there is lots of duplicate work, research results from different groups don't usually interact easily, and much potential for collaboration and synergy is wasted.

While there are certainly reasons while these two areas are different, I think there are ways to make research more interactive and open. And while probably most people aren't willing to switch to open notebook science, I think there are a few things which you can try out now:

  • Communicate to people through your blog, or by Twitter or Facebook, and let them know what you're working on, even before you have polished and published it. And if you don't feel comfortable to disclose everything, how about some preliminary plots or performance numbers? To see how others are using social networks to communicate about their research, have a look at the machine learning twibe, or my (entirely non-authoritative) list of machine learning twitterers, or lists of machine learning people others have compiled, or another list of machine learning related blogs.

  • Release your software as early as possible, and make use of available infrastructure like blogs, mailing lists, issue trackers, or wikis. There are almost infinitely many options to go about this, either using some site like github,sourceforge, kenai, launchpad, savannah, or by setting up a private repository, for example using [trac], or just a bare subversion repository. It doesn't have to be that complicated, though. You can even just put a git repository on your static homepage and have people pull from there. And of course, register your project with mloss, such that others can find it and stay up to date on releases.

  • Turn your research project into a software project to create something others can readily reuse. This means making your software usable for others, interface it with existing software, and also, start reusing existing software as well. It doesn't have to be large if it's useful. Have a look at mloss for a huge list of already existing machine learning related software projects.