Nat Torkington on Open Data
by Cheng Soon Ong on March 9, 2010 (0 comments)
I recently came across a blog on O'Reilly Radar about Truly Open Data, which talks about how concepts from open source software can be translated to open data. Basically, apart from just "getting the data out there", we need software tools for managing this data. I summarize his list of tools below, with some thoughts on how this may apply to machine learning data.
- diff and patch - Perhaps we need some md5sum for binary data? It seems that most machine learners actually don't use "live" data very often, so perhaps these resources are not needed for us?
- version control
- releases - An obvious release point would be upon submission of a paper. One downside I realized about double blind reviewing is that one cannot release new data (or software) upon submission. Some things are just easier to do with some real bits.
- documentation - Apart from bioinformatics data that I generated myself, I'd be hard pressed to name one dataset (apart from iris) where I know the provenance of the data.
Daniel Lemire on Open Source Software
by Mikio L. Braun on February 16, 2010 (1 comment)
Daniel Lemire has an interesting blog post on whether open sourcing your software affects your competitiveness as a researcher.
In short, here is his summary:
- Sharing can’t hurt the small fish.
- Sharing your code makes you more convincing.
- Source code helps spread your ideas faster.
- Sharing raises your profile in industry.
- You write better software if you share it.
Which is very much in line with why we started the whole initiative in the first place.
MLOSS 2010 - ICML Workshop just accepted
by Soeren Sonnenburg on February 12, 2010 (0 comments)
We are glad to announce that our MLOSS 2010 workshop at this years ICML conference has been accepted!
We are therefore happily accepting software submissions. The deadline for the submissions is April 10th, 2010. If accepted, you can present your software to the workshop audience, which is a great opportunity to make your piece of software more known to the machine learning community.
Like last time, we will use mloss.org for managing the submissions. You basically just have to register your project with mloss.org and add the tag icml2010 to it. For more information, have a look at the workshop page.
Missing values
by Cheng Soon Ong on February 2, 2010 (5 comments)
We were recently working on a way for efficiently representing data, and came across the problem of missing values. For simple tabular formats with the same type (e.g. all real values), it is convenient to store data as a 2-D array. We are thinking of a Python numpy array, but I'm sure any solution should be language independent. However, very often, datasets contain missing values, which are indicated by some special character, for example by '?' in weka's arff format. Unfortunately, the character '?' is not a real number, hence stuffing up the array.
Does anyone have a suggestion on how to deal with this?
Note that I'm not talking about something like missing value imputation, but just the question of how to represent simple tabular data in computer memory. Of course, the question can be generalized such that some features may have different types from others.
This seems like such a common problem that there must be hundreds of solutions out there...
Data and Code Sharing Roundtable
by Victoria Stodden, Chris Wiggins on January 26, 2010 (2 comments)
As pointed out by the authors of the mloss position paper [1] in 2007, "reproducibility of experimental results is a cornerstone of science." Just as in machine learning, researchers in many computational fields (or in which computation has only recently played a major role) are struggling to reconcile our expectation of reproducibility in science with the reality of ever-growing computational complexity and opacity. [2-12]
In an effort to address these questions from researchers not only from statistical science but from a variety of disciplines, and to discuss possible solutions with representatives from publishing, funding, and legal scholars expert in appropriate licensing for open access, Yale Information Society Project Fellow Victoria Stodden convened a roundtable on the topic on November 21, 2009. Attendees included statistical scientists such as Robert Gentleman (co-developer of R) and David Donoho, among others.
The inspiration for this roundtable was the leadership of the genome research community in establishing the open release of sequence data. Representatives from that community gathered in Bermuda in 1996 to develop a cooperative strategy both for genome decoding and for managing and sharing the resulting data. Their meeting resulted in the "Bermuda Principles" [13] that shaped the ensuing data sharing practices among researchers and ensured rapid data release. In the computational research community more generally the incentives and pressures can differ from those in human genome sequencing; consequently, the roundtable sought to consider the issues in a larger context. A second goal of the workshop was to produce a publishable document discussing reactions to data and code sharing in computational science. We also published short topical thought pieces [14] authored by participants, including by statistical scientists [15-16], raising awareness of the issue of reproducibility in computational science.
The Data and Code Sharing Roundtable adapted the focus of the genomics community to include access to source code as well as data, across the computational sciences. This echoes mloss's call for "the supporting software and data" to be openly distributed through the mloss repository with links to alternatively stored data collections. The Yale roundtable was organized in five parts: framing issues, examining legal barriers and solutions, considering the role of scientific norms and incentives, discussing how technological tools help and hinder sharing, and finally crafting key points for release in a statement. The agenda is available online [17] with links to each session's slide decks.
The first session framed issues across the disparate fields and was moderated by Harvard Astronomy Professor Alyssa Goodman, and featured presentations from Mark Gerstein, the Albert L. Williams Professor of Computational Biology and Bioinformatics at Yale, Randy LeVeque, the Founders Term Professor of Applied Mathematics at the University of Washington, and Alyssa Goodman herself. The second session was moderated by Frank Pasquale, the Loftus Professor of Law at Seton Hall University, and discussed legal barriers to the sharing of research codes and data and presented alternate licensing frameworks to enable sharing. Pat Brown, Professor of Biochemistry at Stanford University, moderated the session on norms and incentives, leading a discussion of publishing models, peer review, and reward structures in the scientific community. The session on computational solutions was moderated by Ian Mitchell, Computer Science Professor at the University of British Columbia, and examined computational solutions (see for example Matt Knepley's slides from that session). The final session summarized findings and recommendations to be drafted into a jointly authored published statement. The organizers are in the process of creating this "position statement," compiled from the discussions at the workshop and from "thought pieces" contributed by attendees.
We invite members of mloss.org to consider contributing such a thought piece, and hope that the open source community within machine learning will find the thought pieces, slides, or position statement useful in promoting distribution of source code as part of the scientific publication process and promoting reproducible computational science more generally.
Sincerely,
Victoria Stodden
Yale Law School, New Haven, CT
Science Commons, Cambridge, MA
http://www.stanford.edu/~vcs
Chris Wiggins
Department of Applied Physics and Applied Mathematics,
Columbia University, New York, NY
http://www.columbia.edu/~chw2
References:
- [1] Sonnenburg, "The need for open source software in machine learning" Journal of Machine Learning Research, 8:2443-2466, 2007 http://j.mp/52JaPJ;
- [2] Social science: Gary King, the Albert J. Weatherhead III University Professor at Harvard University, has documented his efforts in the social sciences at his website http://j.mp/4FfCqz. He also runs The Dataverse Network, a repository for social science data and code http://thedata.org;
- [3] Geophysics: Stanford Geophysics Professor Jon Claerbout's efforts in Geoscience: http://j.mp/7ZHNEe;
- [4] Geophysics: University of Texas at Austin Geosciences Professor Sergey Fomel's open source package for reproducible research, Madagascar: http://j.mp/6UipCZ;
- [5] Signal processing: Signal Processing at Ecole Polytechnique Federale de Lausanne, Reproducible Research Repository; including Vandewalle, Patrick and Kovacevic, Jelena and Vetterli, Martin (2009) "Reproducible Research in Signal Processing - What, why, and how" IEEE Signal Processing Magazine, 26 (3). pp. 37-47 (http://j.mp/6Rc5H2);
- [6] Databases: The database community tested replication in SIGMOD 2009 submissions; cf. I. Manolescu, L. Afanasiev, A. Arion, J. Dittrich, S. Manegold, N. Polyzotis, K. Schnaitter, P. Senellart, S. Zoupanos, D. Shasha, et al. "The Repeatability Experiment of SIGMOD 2008" SIGMOD Record, 37(1):39, 2008 http://j.mp/7SWNli;
- [7] Databases: R.V. Nehme. "Black Hole in Database Research" http://j.mp/4QODKd;
- [8] Climate: "Please, show us your code" RealClimate, Rasmus E. Benestad http://j.mp/8bj0CS;
- [9] Economics: BD McCullough. "Got replicability?" The Journal of Money, Banking and Credit Archive. Econ. Journal Watch, 4(3):326-337, 2007 http://j.mp/6otJMx;
- [10] Linguistics: T. Pedersen. "Empiricism is not a matter of faith" Computational Linguistics, 34(3):465-470, 2008. http://j.mp/31CwFH;
- [11] Computational Biology: Jill P. Mesirov. "Accessible Reproducible Research" Science 22 January 2010: Vol. 327. no. 5964, pp. 415 - 416 http://j.mp/54SDTv;
- [12] General sources on reproducibility: http://www.rrplanet.com/ and http://reproducibleresearch.net/;
- [13] "Bermuda Rules: Community Spirit, With Teeth" Science 16 February 2001: Vol. 291. no. 5507, p. 1192 http://j.mp/4TP2BV;
- [14] Thought pieces available via http://j.mp/4EpcMD;
- [15] "Reproducible research and genome scale biology: approaches in Bioconductor" Vincent Carey and Robert Gentleman, http://j.mp/8xlPLR;
- [16] "View Source" Chris Wiggins http://j.mp/89lDC9;
- [17] Agenda for roundtable available via http://j.mp/5MlmUG.
The Open Source Process and Research
by Mikio Braun on January 13, 2010 (0 comments)
(Cross posted on blog.mikiobraun.de)
I think there is more to be learned from the open source software development process than just publishing the code from your papers. So far, we've mostly focused on making the software side more similar to publishing scientific papers, for example, through creating a special open source software track at JMLR.
However, there is more to be learned from the open source software development process:
- "Release early, release often" Open source software is not only about making your software available for others to reuse, but it is also about getting in touch with potential users as early as possible, as closely as possible.
Contrast this with the typical publication process in science where there lie months between your first idea, the submission of the paper, its publication, and the reactions through follow-up and response papers.
- Self-organization collaboration One nice thing about open source software is that you can often find an already sufficiently good solution for some part of your problem. This allows you to focus on the part which is really new. If existing solutions look sufficiently mature and their projects healthy, you might even end up relying on others for part of your project, which is really interesting given that you don't even know these people or have ever talked to them. But if the project is healthy, there is a good chance that they will do their best to help you out, because they want to have users for their own project.
Again, contrast this with how you usually work in science, where it's much more common to collaborate with people from your group or people within the same project only. Even if there were someone working on something which would be immensely useful for you, you wouldn't know till months later when their work is finally published. The effect is that there is lots of duplicate work, research results from different groups don't usually interact easily, and much potential for collaboration and synergy is wasted.
While there are certainly reasons while these two areas are different, I think there are ways to make research more interactive and open. And while probably most people aren't willing to switch to open notebook science, I think there are a few things which you can try out now:
Communicate to people through your blog, or by Twitter or Facebook, and let them know what you're working on, even before you have polished and published it. And if you don't feel comfortable to disclose everything, how about some preliminary plots or performance numbers? To see how others are using social networks to communicate about their research, have a look at the machine learning twibe, or my (entirely non-authoritative) list of machine learning twitterers, or lists of machine learning people others have compiled, or another list of machine learning related blogs.
Release your software as early as possible, and make use of available infrastructure like blogs, mailing lists, issue trackers, or wikis. There are almost infinitely many options to go about this, either using some site like github,sourceforge, kenai, launchpad, savannah, or by setting up a private repository, for example using [trac] http://trac.edgewall.org/), or just a bare subversion repository. It doesn't have to be that complicated, though. You can even just put a git repository on your static homepage and have people pull from there. And of course, register your project with mloss, such that others can find it and stay up to date on releases.
Turn your research project into a software project to create something others can readily reuse. This means making your software usable for others, interface it with existing software, and also, start reusing existing software as well. It doesn't have to be large if it's useful. Have a look at mloss for a huge list of already existing machine learning related software projects.
MLOSS ICML 2010 workshop?
by Soeren Sonnenburg on December 16, 2009 (2 comments)
We are thinking of organizing an ICML 2010 workshop on machine learning open source software. Does anyone here think this is a great idea like we do? If you would see this happen, please contact us and help us organize it.
Thanks!
US open access policy
by Cheng Soon Ong on December 14, 2009 (0 comments)
The Office of Science and Technology Policy of the United States of America is having a public consultation on Public Access Policy, which will run till 7 January 2010. The first part (10-20 December 2009) considers implementation issues, in particular:
- Who should enact public access policies?
- How should a public access policy be designed?
The next two sections are (details here):
- Features and Technology (Dec. 21 to Dec 31)
- Management (Jan. 1 to Jan. 7)
If you care about how your research is being published, head over and give your views.
Documentation is hard to do
by Cheng Soon Ong on December 4, 2009 (0 comments)
There was an article at TechNewsWorld yesterday about the poor state of documentation in Linux. It seems that for most projects, there are two kinds of people: the users and the developers. Users always complain that the documentation is not good enough, and developers don't see the point of writing it. Funnily, once some tech savvy user starts digging around in the code a bit, he/she one day wakes up and finds that they have crossed the fence, i.e. the project which they initially said was badly documented is now what they are actively contributing to. Even worse, they also often don't write documentation themselves.
The pragmatic programmer gives two tips about documentation:
- Treat English as just another programming language
- Build documentation in, don't bolt it on
Then it goes on to distinguish between internal and external documentation. I think that for machine learning, the external part is really important. Very often, the users of machine learning software are not experts in the field, and "just" downloaded the code to see whether they can solve their problem. In fact, very often, the user is not even familiar with the programming language that the project is implemented in. Each language has its own idiosyncrasies, and projects should try to have at least a README file that tells the user how to get things working. Some basic things like how to compile, and specific command line operations to get the paths correct, etc. Even interpreted languages can be tricky. For example, matlab often requires the right set of addpath statements to get things working, python requires that $PYTHONPATH be set correctly.
It happens quite often that reviewers of JMLR submissions complain of not being able to "get the code working". Sometimes this is due to a deeper problem, but often it is just because the reviewer is not a user of the programming language of the submission. Now, before you criticize me and ask why I don't choose better reviewers; if you take the intersection of machine learning expertise, programming language and operating system, you often end up with only one group of people, namely the ones that submitted the project.
Matlab(tm) 7.3 file format is actually hdf5 and can be read from other languages like python
by Soeren Sonnenburg on November 19, 2009 (2 comments)
It looks like that matlab version 7.3 and later are capable of writing out objects in the so called matlab 7.3 file format. While at first glance it looks like another proprietary format - it seems to be in fact the Hierarchical Data Format version 5 or in short hdf5.
So you can do all sorts of neat things:
Lets create some matrix in matlab first and save it:
>> x=[[1,2,3];[4,5,6];[7,8,9]] x = 1 2 3 4 5 6 7 8 9 >> save -v7.3 x.mat xLets investigate that file from the shell:
$ h5ls x.mat x Dataset {3, 3} $ h5dump x.mat HDF5 "x.mat" { GROUP "/" { DATASET "x" { DATATYPE H5T_IEEE_F64LE DATASPACE SIMPLE { ( 3, 3 ) / ( 3, 3 ) } DATA { (0,0): 1, 4, 7, (1,0): 2, 5, 8, (2,0): 3, 6, 9 } ATTRIBUTE "MATLAB_class" { DATATYPE H5T_STRING { STRSIZE 6; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SCALAR DATA { (0): "double" } } } } }And load it from python:
>>> import h5py >>> import numpy >>> f = h5py.File('x.mat') >>> x=f["x"] >>> x <HDF5 dataset "x": shape (3, 3), type "<f8"> >>> numpy.array(x) array([[ 1., 4., 7.], [ 2., 5., 8.], [ 3., 6., 9.]])
So it seems actually to be a good idea to use matlab's 7.3 format for interoperability.