Open Thoughts

December 2011 archive


December 16, 2011

Meeting Cheng at NIPS we had a discussion on how to improve user experience of So I got my hands dirty and fixed a few minor issues on

A long standing feature request was that software that once appeared in JMLR will continue to be tagged published in JMLR and be highlighted. This should now be the case.

In addition, I again did limit the automagic pulling of r-cran packages to now happen once / month only. This should give manually updated software a higher visibility again.

If you have suggestions for improvements and don't mind to code a little python's source code is now availabe on github. In particular, if you'd like to attempt to improve the R-CRAN slurper it's code is here.

Mendeley/PLoS API Binary Battle (winners)

December 6, 2011

The results of the Mendeley/PLoS API Binary Battle are out:



Share your personal genome from 23andMe or deCODEme to find the latest relevant research and let scientists discover new genetic associations.

1st runner up


Continual reviews of papers, even after they are published.

2nd runner up


Something close to my heart, programmatic interface to data!

What is a file?

December 5, 2011

What is a file?

Two bits of news appeared recently:

  • The distribution of file sizes on the internet indicates that the human brain limits the amount of produced data. The article however observes that "it'll be interesting to see how machine intelligence might change this equation. It may be that machines can be designed to distort our relationship with information. If so, then a careful measure of file size distribution could reveal the first signs that intelligent machines are among us!"

  • Paul Allen's Institute has been publishing its data in an open fashion. Ironically, the article is behind a paywall. However, the Allen Institute for Brain Science has a data portal.

I wondered about the distribution of data which is clearly machine generated and in some sense most easily digested by machine as well. It turns out that it is quite difficult to find out how big files are. In some sense, for the brain atlas, the amount of data (>1 petabyte of image data alone) is more than is easily transferable across the internet. Most human users of this data would use some sort of web based visualization of the data, and hence the meaning of the word "file" isn't so obvious. In fact, there has been a recent trend to "hide" the concept of a file. One example is iPhones and iPads where you do not have access to the file system, and hence do not really know whether you are transfering parts of a file or streaming bytes. Another example is Google's AppEngine, where users access data through a database. A third example is Amazon's Silk browser which "renders" a web page in a more efficient fashion using Amazon's infrastructure rather than your local client.

If we take the extreme view that we use some sort of machine learning algorithm to filter the world's data for our consumption, this implies that all the world's data is in one "file", and we are just looking at parts of it. From this point of view, the paper about using file sizes to reveal machine intelligence is not going to work. In fact, thinking about file sizes in the first place is just plain misleading.