Open Thoughts

Nat Torkington on Open Data

Posted by Cheng Soon Ong on March 9, 2010

I recently came across a blog on O'Reilly Radar about Truly Open Data, which talks about how concepts from open source software can be translated to open data. Basically, apart from just "getting the data out there", we need software tools for managing this data. I summarize his list of tools below, with some thoughts on how this may apply to machine learning data.

  • diff and patch - Perhaps we need some md5sum for binary data? It seems that most machine learners actually don't use "live" data very often, so perhaps these resources are not needed for us?
  • version control
  • releases - An obvious release point would be upon submission of a paper. One downside I realized about double blind reviewing is that one cannot release new data (or software) upon submission. Some things are just easier to do with some real bits.
  • documentation - Apart from bioinformatics data that I generated myself, I'd be hard pressed to name one dataset (apart from iris) where I know the provenance of the data.


Yaroslav Halchenko (on March 14, 2010, 01:04:36)

well, not really to add but to remind previously mentioned in "Data and Code Sharing Roundtable": -- that one seems to be a good approach to data sharing.

Leave a comment

You must be logged in to post comments.