Open Thoughts

August 2010 archive

Open Data: good or bad?

August 19, 2010

Is sharing always good?

We've been thinking about ways to make it easy for machine learners to exchange data and methods. The assumption behind all this is that sharing is good, and we (as researchers funded by taxpayers' money) should be open with our work.

As has been mentioned before, several funding agencies are pushing for open access to the results of research. One recent story highlights the progress that has been made with Alzheimer’s, in part due to data sharing. As fledgling collaborators know, it is really hard to work in large teams of people. This project amazingly brings together the National Institutes of Health, the Food and Drug Administration, the drug and medical-imaging industries, universities and nonprofit groups. In fact, one of the key things they had to build was a way for all the different project partners to upload their data. In fact, they had two sites, one for clinical data and a second for the imaging data. And (my machine learning heart cheers) they even detail how they do cross validation. Bottom line? Data sharing has made the project possible.

Would you like your data to be public? Or first private, and only made public after you are "done" with it?

There has been some concern about genome data being uploaded directly to web servers, available to the general public. For example the Joint Genome Institute puts sequences online, and collaborators get the data at the same time as the general public. So, even if you had the idea to sequence the genome of a particularly interesting organism, someone else might scoop you to the paper if they are faster as analysing the sequences.

I think a middle ground is probably the way to go, in the words of InfoVegan, a github for data.

Moving to TU-Berlin

August 12, 2010

I have just moved the database and all content from a server running at Max-Planck in Tuebingen to TU Berlin - where two of the developers are currently working. This significantly eases maintainability and re-adds some of the functionality that we previously had on but was disabled due to security concerns. For example rss aggregators will work again as will the CRAN-R integration of their machine learning repository. In addition, the good news is that this server is twice as powerful (more memory, more hard disk space, more cpu power) and has more bandtwidth, such that we now also have ssl secured logins. Stay tuned and please notify me if you notice any glitches that could possibly have been caused by the transition.

We thank Max Planck Tuebingen for a reliable 3.5 years of hosting!