Open Thoughts

Conflicting OSS goals

Posted by Tom Fawcett on May 12, 2010

It occurred to me while reviewing that the goals of OSS contributors and users are quite varied. Often these goals are in conflict. For example, here are a few ways of classifying packages I've noticed:

  • library with APIs vs complete package (end-to-end). Some packages are libraries with comprehensive APIs and are meant to be used as components in larger systems (or at least they assume the larger system will handle IO, evaluation, sampling, statistics, etc.). Other packages accomodate reading from standard formats (eg CSV, ARFF) and handle evaluation and other aspects of experimentation.

  • packages that produce intelligible models (trees, rules, visualizations) vs packages that produce black-box models. Some experimenters want/demand to understand the model, and a black-box "bag of vectors" won't work no matter how good the predictions.

  • flexible, understandable code vs efficient code. Some packages are written to be clean and extensible, while others are written to be efficient and fast. (Of course, some packages are neither :-)

  • single system vs platform for many algorithms. While some researchers contribute single algorithm implementations, there is a clear trend toward large systems (Weka, Orange, scikit.learn, etc.) which are intended to be platforms for families or large collections of algorithms.

In turn, a lot of this depends on whether the user is a researcher who wants to experiment with algorithms or a practitioner who wants to solve a real problem. Packages written for one goal are often useless for another. A program designed for several thousand examples that just outputs final error rates won’t help a practitioner who wants to classify a hundred thousand cases; a package with an interactive interface is very cumbersome for someone who needs to report extensive cross-validation experiments.

It’s clear from the JMLR OSS Review Criteria (http://jmlr.csail.mit.edu/mloss/mloss-info.html) that JMLR hasn’t thought about the wide variety of software issues. So I suggest that the mloss.org organizers (and contributors) start to think of useful categories for their code that can help people understand and navigate this space.

Comments

Soeren Sonnenburg (on May 13, 2010, 09:16:55)

While I agree that these goals are conflicting from a design perspective, I feel that all of these realizations can peacefully coexist in the mloss ecosystem :)

I mean for example a single algorithm (if popular and implemented well) will likely be sucked in by the major toolboxes: A good example here is libsvm. I guess every major toolbox has it (at least, weka, shogun, orange, ...).

And even though core algorithms are hidden deep in the code, they will still be extracted / modified to fit into other systems.

Same with flexible understandable vs. unreadable but fast code: Depending on the focus of the toolbox, their authors suck in different parts of the code.

And yes, jmlr-mloss wants them all. I guess the only thing to regularize here is that we want clean library like interfaces for single algorithm libraries and bigger frameworks to at least be able to read some common data format(s).

Tom Fawcett (on May 14, 2010, 02:53:29)

I think we're talking past each other. Here are two concrete problems:

  1. The mloss.org index categories could be better. They miss the issues I listed, which are salient details for people trying to find code.

  2. JMLR and the MLOSS workshop don't yet understand the variety of OSS contributions. This is evident from their review forms. Based on submissions I saw, authors/submitters don't understand either.

Soeren Sonnenburg (on May 17, 2010, 18:27:54)

I see. So the question is how to achieve 1. and clear up 2? And I would like to add 3: who is willing to help us with this?

Tom Fawcett (on May 21, 2010, 02:50:47)

Issue 2 is the one that was most problematic when reviewing for JMLR and the MLOSS workshop. While mloss.org may want all ML OSS contributions, I really don't think JMLR wants to publish most of them. Deciding publication criteria isn't something volunteers can help with. I think the JMLR editor(s) in charge of the OSS track need to give more thought to what they're looking for and clarify the CFP.

Leave a comment

You must be logged in to post comments.