Open Thoughts

Open Data: good or bad?

Posted by Cheng Soon Ong on August 19, 2010

Is sharing always good?

We've been thinking about ways to make it easy for machine learners to exchange data and methods. The assumption behind all this is that sharing is good, and we (as researchers funded by taxpayers' money) should be open with our work.

As has been mentioned before, several funding agencies are pushing for open access to the results of research. One recent story highlights the progress that has been made with Alzheimer’s, in part due to data sharing. As fledgling collaborators know, it is really hard to work in large teams of people. This project amazingly brings together the National Institutes of Health, the Food and Drug Administration, the drug and medical-imaging industries, universities and nonprofit groups. In fact, one of the key things they had to build was a way for all the different project partners to upload their data. In fact, they had two sites, one for clinical data and a second for the imaging data. And (my machine learning heart cheers) they even detail how they do cross validation. Bottom line? Data sharing has made the project possible.

Would you like your data to be public? Or first private, and only made public after you are "done" with it?

There has been some concern about genome data being uploaded directly to web servers, available to the general public. For example the Joint Genome Institute puts sequences online, and collaborators get the data at the same time as the general public. So, even if you had the idea to sequence the genome of a particularly interesting organism, someone else might scoop you to the paper if they are faster as analysing the sequences.

I think a middle ground is probably the way to go, in the words of InfoVegan, a github for data.


No one has posted any comments yet. Perhaps you'd like to be the first?

Leave a comment

You must be logged in to post comments.