Open Thoughts

April 2009 archive

What is an "easy to build" system?

April 23, 2009

One thing that reviewers of submissions to the open source track of JMLR often complain about is that the submitted software doesn't build. At this discovery, some reviewers refuse to look at the rest of the submission. I agree that being able to compile a piece of code is quite an important part of the total score but it should not be the whole story. In fact, the review criteria for JMLR (OSS track) specifically lists other important criteria. Being easy to compile would fall under "good user documentation", since it would be the end user who would benefit from an easy to build system. But in general, once a reviewer is unable to build the submission, he would provide a negatively biased review. Even worse, he may not even consider other parts of the software project.

So, why do reviewers have so much trouble compiling software? The answer is quite complicated, and I would like to try to scratch the surface of this highly charged issue. More in depth recommendations for open source projects can be found for example in Karl Fogel's online book or Eric Steven Raymond's detailed howto. I restrict this post to Linux style "download, unzip and build" type software, ignoring GUI type "double click" installations, such as .dmg packages or .exe installers.

Documentation, documentation, documentation

A number of the compilation issues would be solved if there was clear and precise documentation, and the user reads this documentation. One JMLR submission had two reviewers who could not build the system but a third who commented on how smoothly everything went. It turned out that the author had written in his cover letter that the submitted code was not complete due to file size restrictions on the jmlr website, and reviewers are supposed to get the complete code online.

Apart from documentation for the user to understand what the project does and documentation for the developer on how to extend the project, there are the installation instructions. This includes stuff like how to install, how to upgrade from previous versions, and what dependencies are required. For Linux there are some conventions about how to structure things. If possible, one should stick to one of the standard idioms for compiling software (see the next section). As an aside, Google recently released their software update system.

The build system

The traditional build pipeline is the "configure; make;" system which is popular among C projects such as GNU projects. For python projects there is the setup.py idiom or easy_install. I am not a Java expert, but there seems to be a large plethora of build tools available. At the top of my ease of installation list comes the R community which has agreed on a single distribution channel. There seems to be a few up and coming build systems such as cmake, scons, waf and jam. If one uses too exotic a build system, the reviewers probably won't have it on their box and would have to first obtain the build system. However, often one would like the nicer features provided by the newer systems. Further, often JMLR reviewers are not experts in the language that the project is written in and are not familiar with the standard idioms (but this can be fixed by good documentation). It is a tough call...

One thing I've found quite nice is when projects have instructions on how to check that your build has completed successfully. For machine learning software, this can be a small example on toy data which allows that user to confirm that things are working as they should be.

Dependencies

Dependencies are a double edged sword. On one hand, one would like to take advantage of the efficiency of having highly optimized libraries such as blas, lapack, boost or GNU scientific library. But this often means that you may have to track changes in the dependencies or the user may not have dependencies available. We had one JMLR submission which used a combination of python and C++. One reviewer had a terrible time trying to get it working since first he was not familiar with python dependencies and second because his linux distribution provides python headers in a different package (and he didn't know).

Conclusion

There are all sorts of strange things that can happen while a user is trying to install your software. One should try to follow one of the common idioms for your language such that the user feels comfortable with the build. But at the end of the day, nothing beats real life testing. So, list your software on mloss.org before you try to submit to JMLR. It may just allow you to catch some installation bugs before they upset your reviewers.

Who is allowed to list software?

April 2, 2009

We have been having a quite heated discussion among the organisers of mloss.org about whether we should encourage the submission of open source projects which are not their own.

We wanted (and still want to) follow the model of freshmeat where we are not actually hosting any projects at all, but really just providing links to project homepages. However, since a user can upload a tar or zip repository, in principle he or she can actually do some very "bare bones" hosting just on mloss.org. This is in contrast to a sourceforge or googlecode style project which gives you all the infrastructure necessary to host an open source project.

So our framework actually allows anybody to submit an mloss project, not only the authors of a package. However, as far as we can tell, only authors have submitted (their own) projects. The question is why haven't anyone submitted stuff that isn't their own? Are people afraid of the competition?

On the other hand, if we start encouraging people to list software they find, will there be problems with quality? Will the original authors of the projects be upset?

MLOSS progress updates

April 2, 2009

Visitor statistics

When looking at our access statistics, we see this very nice periodic curve with peak accesses on Tuesday with around 200 users per day. The slowest days? The weekends, with around 100 users per day. So, it seems that people come back to work on Monday and get their weekly fix of mloss.org. The peak on Tuesday is because our site is at CET and many of our accesses come from across the Atlantic. The USA leads the number of visitors list, but people from Croatia and Denmark look at the most pages (more than 6 on average).

In the last month, we've had 172 (123 unique) visitors to this blog, which is more than I thought it would be. I kind of expected that people use mloss.org as a place to find some software and to update what they have. But it seems that some people actually read this blog. :-) However, it is clear that most people just come for the software. It is quite hard to tell exactly how many of our visitors are actually real people, and how many are just web crawlers. A rough guess is that at least half the visitors to mloss.org are machines since they spend less than 10 seconds on the site.

Machine Learning Data

The discussion about a format for machine learning data seems to have ground to a halt. Do machine learners really not care about exchanging data automatically? Let us know your thoughts!