August 2009 archive
Netflix: part 1
August 10, 2009
As most of you may know, the Netflix prize came to an exciting conclusion recently. The official results are not out yet about which of the top two teams on the leaderboard, The Ensemble or BellKor's Pragmatic Chaos, will win the 1 million dollar prize. The leaderboard shows the results on a public test set, but the grand prize winner will be evaluated on a secret test set by Netflix.
Anyway, I emailed the teams to ask them whether they used any machine learning open source software in their prize winning efforts. In general, the feeling I get from the responses is that both teams rolled their own solutions. They were also understandably reluctant to share their methods since the official results are not out yet, and also the fact that Netflix in essence owns the IP.
Greg McAlpin from The Ensemble was kind enough to collect information from his team and provide me with the following summary of open source software that they used. Unfortunately, they also did not want to share their machine learning methods.
Our team decided that it would be best to wait until Netflix officially
announces the winner of the competition before we talk about how we used
any open source software that is related to machine learning.
We used plenty of open source tools though. Different members of the
team used:
JAMA/TNT, Mersenne Twister, Ruby, Perl, Python, R, Linux, gcc (and tool
chain), gsl, tcl, mysql, openmp, CLAPACK, BLAS, all of the CygWin GNU
software
Many members of our team first met on a Drupal website. And personally,
I could never have kept track of everything that was going on without
TiddlyWiki.
I know that this isn't really what you were asking for. Much of the
existing open source software that we were aware of was not able to
handle the size of the Netflix Prize data set. I don't think that
anyone got Weka or even Octave to work with the data. Some excellent
new open source frameworks were created by people competing for the
Netflix Prize. It was interesting to me that code.google.com became the
home for many open source projects (instead of sourceforge).
PLoS to publish software
August 8, 2009
In a recent article on genomeweb link, they said that PLoS may start offering an open source software track in the near future. The new editor will be Robert Murphy, whose lab has published software for image analysis in protein subcellular localization.
Apparently at the same Biolink SIG at the recent ISMB in Stockholm, they also discussed publishing of data. Since we are also thinking about how to distribute data we will be watching developments at PLoS closely. They also discuss how to make papers more machine readable by semantic markup. The example used looks like it took a lot of effort from the publishers, and I wonder whether it is feasible for journals to do this for all their published papers.