Open Thoughts

May 2011 archive

Swamped in R-CRAN updates

May 24, 2011

It seems like the regular updates of packages in R-CRAN are starting to hide the manually updated packages on We are therefore only updating R-CRAN packages once per week (instead of daily as we used to).

I hope this gets your packages increased visibility again.

reclab prize

May 17, 2011

After the success of the Netflix prize, it seems that would also like to entice smart machine learners to solve their recommendation problem too. The idea is the same, improve 10% over the baseline to win 1 million dollars.

Details are available at:

A couple of things are different though:

  • There is a 250,000 bonus for your academic institution.
  • The leaders of the Netflix prize were all using ensemble type classifiers (see literature below, and previous post), and it seems like the reclab prize wants to have some diversity by actually having "peer review" to choose the semi-finalists.
  • Instead of having a fixed training and test set, the best algorithms would be run against live traffic.
  • Since software is much smaller than the data, it makes much more sense to move source code to data than vice versa. And competitors must submit source only!
  • You can (kind of) use third party code, as long as it is on Maven. Strange restriction on the type license really. It may make sense to not allow GPL "contamination", but all the other open source licenses?

You can bias the competition to your favour by nominating your friends as reviewers. ;-)

The Netflix winners

  • Y. Koren, "The BellKor Solution to the Netflix Grand Prize" PDF (2009).
  • A. Töscher, M. Jahrer, R. Bell, "The BigChaos Solution to the Netflix Grand Prize" PDF (2009).
  • M. Piotte, M. Chabbert, "The Pragmatic Theory solution to the Netflix Grand Prize" PDF (2009).

Open Data Challenge

May 13, 2011

20,000 euros to be won at:

Just so that this doesn't sound too much like a scam, this is a competition that is closing soon, and it is being organised by the Open Knowledge Foundation and the Open Forum Academy. There are four categories:

  • Ideas
  • Apps
  • Visualizations
  • Datasets

A problem with reproducible research

May 3, 2011

One weird side effect of open source software and reproducible research is that it would make it much more challenging to set meaningful computational exercises for teaching.

I'm organising a course this semester that looks are various applications of matrix factorization. The students solve various matrix problems throughout semester, and apply them to solve questions such as compression, collaborative filtering, role based access control and inpainting. The various solutions to the applications are ranked, and students are graded based on their rank in class for this part of the course. At the end of the semester, there is a small project where the students have to do something novel, and write up a short paper about it. We thought about trying to encourage open source submissions to the exercises and projects, but quickly realized that it would raise the bar.

If all students submitted open solutions to their exercises, than it would quickly become a plagiarism checking nightmare for the teaching assistants, since students submitting later would be able to copy earlier solutions. However, requiring each exercise submission to be different from previous ones is also somewhat unfair, as it quickly becomes quite difficult to find new ways to solve an exercise. Just to put things in perspective, exercises are simple things like using singular value decomposition to perform image compression. However, making solutions public has all the benefits of that we know and love from open source software. More importantly in a classroom environment, we encourage the students to learn from each other's solutions and to discuss problems amongst themselves.

Fine, we thought: "we can make the solutions open after the exercise deadline". This somehow defeats the last idea of encouraging students to discuss and solve problems together. Since the lectures then cover different material by then, the students are less motivated to work on a previous exercise. More subtly, it would make the final project much more challenging. If everything was secret, then all the students had to do for the final project was whip together some "baseline" methods using their exercise submissions, and develop a "novel" method that beats their baseline. Given the short 6 week time frame for the project, we do not expect significant novelty, but something that was not presented in the lecture. However, if all student exercise solutions were open, the novelty level would quickly rise, as the students would now have a baseline of all submitted exercise solutions.

Even if we could figure out a way to time it such that the solutions could not be copied by other submissions, there is still an effect on the following year's course. Since the previous year's solutions would all be available, the new batch of students start would need to be "different" from all previous iterations of the course. Of course, some "leaks" happen already, since students get solutions from their seniors, and there are already plenty of publicly available open source solutions out there.

In essence, what we need are courses that are unique each year (in each university), and still have "easy" enough exercises.

I'm ashamed to admit that in the end, in the face of these challenges, we decided that we would keep all submissions secret, and did not push an open source idea for this course.