Open Thoughts

Data and Code Sharing Roundtable

Posted by Victoria Stodden, Chris Wiggins on January 26, 2010

As pointed out by the authors of the mloss position paper [1] in 2007, "reproducibility of experimental results is a cornerstone of science." Just as in machine learning, researchers in many computational fields (or in which computation has only recently played a major role) are struggling to reconcile our expectation of reproducibility in science with the reality of ever-growing computational complexity and opacity. [2-12]

In an effort to address these questions from researchers not only from statistical science but from a variety of disciplines, and to discuss possible solutions with representatives from publishing, funding, and legal scholars expert in appropriate licensing for open access, Yale Information Society Project Fellow Victoria Stodden convened a roundtable on the topic on November 21, 2009. Attendees included statistical scientists such as Robert Gentleman (co-developer of R) and David Donoho, among others.

The inspiration for this roundtable was the leadership of the genome research community in establishing the open release of sequence data. Representatives from that community gathered in Bermuda in 1996 to develop a cooperative strategy both for genome decoding and for managing and sharing the resulting data. Their meeting resulted in the "Bermuda Principles" [13] that shaped the ensuing data sharing practices among researchers and ensured rapid data release. In the computational research community more generally the incentives and pressures can differ from those in human genome sequencing; consequently, the roundtable sought to consider the issues in a larger context. A second goal of the workshop was to produce a publishable document discussing reactions to data and code sharing in computational science. We also published short topical thought pieces [14] authored by participants, including by statistical scientists [15-16], raising awareness of the issue of reproducibility in computational science.

The Data and Code Sharing Roundtable adapted the focus of the genomics community to include access to source code as well as data, across the computational sciences. This echoes mloss's call for "the supporting software and data" to be openly distributed through the mloss repository with links to alternatively stored data collections. The Yale roundtable was organized in five parts: framing issues, examining legal barriers and solutions, considering the role of scientific norms and incentives, discussing how technological tools help and hinder sharing, and finally crafting key points for release in a statement. The agenda is available online [17] with links to each session's slide decks.

The first session framed issues across the disparate fields and was moderated by Harvard Astronomy Professor Alyssa Goodman, and featured presentations from Mark Gerstein, the Albert L. Williams Professor of Computational Biology and Bioinformatics at Yale, Randy LeVeque, the Founders Term Professor of Applied Mathematics at the University of Washington, and Alyssa Goodman herself. The second session was moderated by Frank Pasquale, the Loftus Professor of Law at Seton Hall University, and discussed legal barriers to the sharing of research codes and data and presented alternate licensing frameworks to enable sharing. Pat Brown, Professor of Biochemistry at Stanford University, moderated the session on norms and incentives, leading a discussion of publishing models, peer review, and reward structures in the scientific community. The session on computational solutions was moderated by Ian Mitchell, Computer Science Professor at the University of British Columbia, and examined computational solutions (see for example Matt Knepley's slides from that session). The final session summarized findings and recommendations to be drafted into a jointly authored published statement. The organizers are in the process of creating this "position statement," compiled from the discussions at the workshop and from "thought pieces" contributed by attendees.

We invite members of mloss.org to consider contributing such a thought piece, and hope that the open source community within machine learning will find the thought pieces, slides, or position statement useful in promoting distribution of source code as part of the scientific publication process and promoting reproducible computational science more generally.

Sincerely,

Victoria Stodden
Yale Law School, New Haven, CT
Science Commons, Cambridge, MA
http://www.stanford.edu/~vcs

Chris Wiggins
Department of Applied Physics and Applied Mathematics,
Columbia University, New York, NY
http://www.columbia.edu/~chw2

References:

  • [1] Sonnenburg, "The need for open source software in machine learning" Journal of Machine Learning Research, 8:2443-2466, 2007 http://j.mp/52JaPJ;
  • [2] Social science: Gary King, the Albert J. Weatherhead III University Professor at Harvard University, has documented his efforts in the social sciences at his website http://j.mp/4FfCqz. He also runs The Dataverse Network, a repository for social science data and code http://thedata.org;
  • [3] Geophysics: Stanford Geophysics Professor Jon Claerbout's efforts in Geoscience: http://j.mp/7ZHNEe;
  • [4] Geophysics: University of Texas at Austin Geosciences Professor Sergey Fomel's open source package for reproducible research, Madagascar: http://j.mp/6UipCZ;
  • [5] Signal processing: Signal Processing at Ecole Polytechnique Federale de Lausanne, Reproducible Research Repository; including Vandewalle, Patrick and Kovacevic, Jelena and Vetterli, Martin (2009) "Reproducible Research in Signal Processing - What, why, and how" IEEE Signal Processing Magazine, 26 (3). pp. 37-47 (http://j.mp/6Rc5H2);
  • [6] Databases: The database community tested replication in SIGMOD 2009 submissions; cf. I. Manolescu, L. Afanasiev, A. Arion, J. Dittrich, S. Manegold, N. Polyzotis, K. Schnaitter, P. Senellart, S. Zoupanos, D. Shasha, et al. "The Repeatability Experiment of SIGMOD 2008" SIGMOD Record, 37(1):39, 2008 http://j.mp/7SWNli;
  • [7] Databases: R.V. Nehme. "Black Hole in Database Research" http://j.mp/4QODKd;
  • [8] Climate: "Please, show us your code" RealClimate, Rasmus E. Benestad http://j.mp/8bj0CS;
  • [9] Economics: BD McCullough. "Got replicability?" The Journal of Money, Banking and Credit Archive. Econ. Journal Watch, 4(3):326-337, 2007 http://j.mp/6otJMx;
  • [10] Linguistics: T. Pedersen. "Empiricism is not a matter of faith" Computational Linguistics, 34(3):465-470, 2008. http://j.mp/31CwFH;
  • [11] Computational Biology: Jill P. Mesirov. "Accessible Reproducible Research" Science 22 January 2010: Vol. 327. no. 5964, pp. 415 - 416 http://j.mp/54SDTv;
  • [12] General sources on reproducibility: http://www.rrplanet.com/ and http://reproducibleresearch.net/;
  • [13] "Bermuda Rules: Community Spirit, With Teeth" Science 16 February 2001: Vol. 291. no. 5507, p. 1192 http://j.mp/4TP2BV;
  • [14] Thought pieces available via http://j.mp/4EpcMD;
  • [15] "Reproducible research and genome scale biology: approaches in Bioconductor" Vincent Carey and Robert Gentleman, http://j.mp/8xlPLR;
  • [16] "View Source" Chris Wiggins http://j.mp/89lDC9;
  • [17] Agenda for roundtable available via http://j.mp/5MlmUG.

Comments

Cheng Soon Ong (on January 27, 2010, 23:33:42)

There is an article for more about the above at (arstechnica)[http://arstechnica.com/science/news/2010/01/keeping-computers-from-ending-sciences-reproducibility.ars].

it's time for a major revision of the scientific method

open source software, should provide a key benefit: more interested parties able to evaluate and improve the code. "Not only will we clearly publish better science, but redesigned and updated code bases will be valuable scientific contributions,"

Yaroslav Halchenko (on March 1, 2010, 23:57:45)

Our 'focused review' was recently accepted Statistical learning analysis in neuroscience: aiming for transparency Michael Hanke , Yaroslav O. Halchenko , James V . Haxby and Stefan Pollmann where "We argue that such transparency requires ``neuroscience-aware'' technology for the performance of multivariate pattern analyses of neural data that can be documented in a comprehensive, yet comprehensible way.", i.e. source code of the analysis (and advocate PyMVPA as one of possible approaches to achieve such goal). Such exposure of complete analyses is really needed in neuroimaging field, where statistical learning methods becoming more and more popular and, as you can expect, more provocative "findings" are reported without adequate support to reason or to reproduce them.

I've made preprint of the paper available

Leave a comment

You must be logged in to post comments.