Open Thoughts

February 2010 archive

Daniel Lemire on Open Source Software

February 16, 2010

Daniel Lemire has an interesting blog post on whether open sourcing your software affects your competitiveness as a researcher.

In short, here is his summary:

  1. Sharing can’t hurt the small fish.
  2. Sharing your code makes you more convincing.
  3. Source code helps spread your ideas faster.
  4. Sharing raises your profile in industry.
  5. You write better software if you share it.

Which is very much in line with why we started the whole initiative in the first place.

MLOSS 2010 - ICML Workshop just accepted

February 12, 2010

We are glad to announce that our MLOSS 2010 workshop at this years ICML conference has been accepted!

We are therefore happily accepting software submissions. The deadline for the submissions is April 10th, 2010. If accepted, you can present your software to the workshop audience, which is a great opportunity to make your piece of software more known to the machine learning community.

Like last time, we will use for managing the submissions. You basically just have to register your project with and add the tag icml2010 to it. For more information, have a look at the workshop page.

Missing values

February 2, 2010

We were recently working on a way for efficiently representing data, and came across the problem of missing values. For simple tabular formats with the same type (e.g. all real values), it is convenient to store data as a 2-D array. We are thinking of a Python numpy array, but I'm sure any solution should be language independent. However, very often, datasets contain missing values, which are indicated by some special character, for example by '?' in weka's arff format. Unfortunately, the character '?' is not a real number, hence stuffing up the array.

Does anyone have a suggestion on how to deal with this?

Note that I'm not talking about something like missing value imputation, but just the question of how to represent simple tabular data in computer memory. Of course, the question can be generalized such that some features may have different types from others.

This seems like such a common problem that there must be hundreds of solutions out there...