Open Thoughts

August 2008 archive

Software Freedom Law Center on GPL compliance

August 22, 2008

The Software Freedom Law Center has posted a guide on how to ensure that you do not violate the GNU Public License when using GPL'd software in your project. ArsTechnica also has a few comments.

The guide might also come in very handy if you're legal department is eager to learn more about the implications of using open source software.

Wuala, social online storage

August 15, 2008

There was a small party on last night to celebrate the beta launch of Wuala, the latest in a long line of online storage services. The idea of online storage is compelling: no need to synchronise all your different computers, somebody else takes care of you backup, easy to share data with others. However, the reality of the situation is that there is no free lunch, and for most people, the cost of online storage is prohibitive. There are several free services (for example the list here), but in general, you cannot just upload everything to the cloud and throw away your hard drive.

Wuala lets you store anything -- photos, videos, your latest paper -- for free, with no bandwidth or file size limits. What's the catch? You have to contribute storage, megabyte for megabyte, to the service. You get 1GB free to start with, but any extra space that you need, you have to plug in your own hard drive and offer it for them to add to the cloud. So, basically you convert your hard drive from a private one person device to a shared device with bits of data from everyone. Like GFS, it creates redundant copies of data and distributes them on commodity hardware, and in the case of Wuala, the commodity hardware is your hard drive and the data bus is the internet. When a user transfers data to and from Wuala, they push and pull P2P style from all the different hard drives of their members.

There are two ways to access Wuala, via a web browser and via an application that runs on your computer. The linux version of the application effectively needs the user to have root access to his box, since it calls for an fstab entry. So, for those linux users in academic environments who have centralized admins, this makes life difficult for you. The web browser interface uses java. Their website was a bit slow this morning when I tried it, so be patient with them.

Personally, for storage and backup, I think there are better ways to do it (e.g. buying an external hard drive, cloning my current laptop drive and leaving the external disk with a good friend that I meet regularly). However, if you are sharing data among collaborators, this seems like a wonderful thing to have. Each member of the team contributes some amount of disk space and bandwidth, and Voilà!

Walking the walk

August 14, 2008

We have made the source of mloss.org available at: http://mloss.org/software/view/132/

This site is based on Django, and we have borrowed several components from other open source projects. We hope that by making the source of this site open, we can benefit other communities who also want to build a similar type of site. If you do build a site which lists open source software, and you have some projects which could be of interest to the machine learning community, please let us know. We would love to be able to regularly (automatically) update our site from external sources like what we are currently doing with CRAN (see the earlier blog).

Also, some personal communication from a disgruntled new user convinced us that we should have our forum more clearly located. So, now we have added a new tab to our navigation bar. Hopefully we will have a more lively forum now that it is not "hidden away".

Finally, one plea to those budding python programmers out there who believe in the cause, please join the team.

To those wondering where the headline comes from: http://www.wsu.edu/~brians/errors/walk.html

Interoperability and the Curse of Polyglotism

August 12, 2008

It seems that this homepage is steadily growing. We already have a large number of registered projects covering many different applications and machine learning methods. Time to think where we're heading with all of this.

I think one of the first goals of this whole endeavor is that you can easily find software to methods published elsewhere. Irrespective of whether you're interested in comparing your own method against some method, or if you actually want to apply the method to some real data, being able to find and download the software is a huge improvement with respect to having to re-implement the method based on the paper.

However, I think that ultimately it would be great if some form of interoperability between different software packages which address the same problem would evolve. In particular in a field as machine learning where the number of (abstract) problems is relatively slow, and there exist many competing methods for a given problem (like, for example, two-class classification on vectorial data), and being able to replace one of these methods easily with another one would be very useful.

The way to achieve this is, as everywhere else in the industry, to develop standards. Actually, there are many different level where such standards could be defined, ranging from web-services, over binary APIs to data file formats.

A few week's ago, I advocated the use of modern scripting languages like python or ruby to develop new machine learning toolboxes, but actually with respect to interoperability, this "polyglotism" puts up some new problems. Back in the "old days" when people where mostly using compiled languages, making your software usable for others was a matter of creating a library which could then be linked against new programs. Differences in calling conventions aside, this approach was relatively flexible, for example, you could use a Fortran library in C or a C library in C++.

But if you use a scripting language like python, you can use that library only in python. You cannot like your C file against the python module, or import the module in another language like ruby. If you want to re-use some library in python in another language, you have to invest in some more infrastructure.

The hard way would be to set up a language-agnostic interface to your python code, for example by creating a web-service, or use some form of protocol like CORBA.

The low cost version would be to settle on a common data format. Then, you can in principle combine methods from different environments by storing intermediate results in files. It won't be fast, but it will work.

To support his approach, we have started a discussion some time ago, where we have settled on the ARFF format as a possible starting point. Furthermore, we have started to write and/or compile code for reading and writing ARFF files for a large number of programming languages, such that you do not have to write the file format yourself.

Django 1.0

August 6, 2008

The framework that mloss.org is based on, django, is now approaching version 1.0. So far, we have been using the SVN version of django.

So, of course we are planning to move to django version 1.0 when it become available, and depending on how much time we have maybe even track the betas. To all those silent users out there, please let us know if you find anything strange or wrong with mloss.org.