Re-implement or reuse?
Posted by Cheng Soon Ong on October 14, 2009
On implicit assumption of open source software is that having the source available encourages software reuse. I'd like to turn the question on its head: "Given a (machine learning) task, should I reuse available code or re-implement it?"
Standard software engineering practice Many introductory texts on software engineering teach code reuse as a good thing. This is captured in principles like DRY which says that "Every piece of knowledge must have a single, unambiguous, authoritative representation within a system." However, it refers to software within a project, and what I'm considering here is whether one should reuse code written by someone else (or written by yourself for another project). I think this principle still applies, but one has to be careful about what is exactly meant by "reuse". See Section "Study on OSS".
antipattern (reinvent the wheel) In our JMLR position paper, we had two points that argue for software reuse. We get faster scientific progress by reduced cost for re-implementation of methods; and it is better for combining parallel advances into one package. However, as this blog post eloquently puts it, we shouldn't reimplement the wheel, but reinventing the wheel may give us more suitable trade-offs for our particular problem.
Better than "hacked up" own version It is seldom tempting to re-implement an eigenvalue decomposition, and most people are quite happy to just use the one in LAPACK. This is because most of us believe that the version in LAPACK will be superior to anything we can write ourselves.
Save time by outsourcing maintenance If one reuses a well supported software package with an active development community, then one benefits from each update. Using LAPACK as an example again, it has evolved over the years, and most people take the benefits of numerical stability of eigenvalue decomposition for granted. Ironically, there has been several re-implementations of the eigenvalue decomposition: LAPACK is based on an earlier EISPACK which in turn is based on a package written by James Wilkinson, which were originally implemented in ALGOL. This brings us to the first reason to re-implement.
Existing solution is not good enough This, I think, is one reason most people re-implement something. For example, if a method exists, but is not in your favourite programming language, it can often be a pain to use this code. Of course, there are ways to embed code in another language within your own code, or one can choose to use some linking tool like SWIG. Other reasons that you may not like the current solution is because it uses too much memory, takes too long, etc.
Educational purposes Many "simple" methods, for example boosting or k-means clustering have probably been re-implemented many times. First, because it is "easy" to implement, and second because it is often used as some sort of initial exercise in machine learning for junior researchers. Going back to the reinventing the wheel principle, this blog gives plenty of reasons to re-implement something. In essence, you should reinvent the wheel if you want to learn about wheels. It reminds me of a comment that Leon Bottou had in one of our NIPS workshops with regards to the new implementation of Torch: "If these young people want to reimplement something, you should support it."
Not aware of other solution This ties in to the next point about how much effort it takes to find an existing solution versus the amount of effort it takes to implement the solution yourself. In the study cited below, they argue that if it was easy to find an existing solution, for example through good search tools and powerful indexing (e.g. mloss ;-)), one would be more likely to reuse software. Apparently, many software corporations have programs to encourage reuse of software from both sides of the equation. Making it easier to find relevant code, and also enforcing designs such that existing software is reusable.
Existing interface changes too often This has personally happened to me a few times when utilizing a software package that has not really matured yet, and having to spend time rewriting my own code to track changes in the API of another software library. The above argument for software reuse such that we can outsource maintenance is a double edged sword; it also means that you may have to track other projects.
Study on OSS
The following study crystallized some of the ideas that I have. Incidentally, it was done by a bunch of people down the road from where I work. I may have to drop by to have a chat with them at some point.
Stefan Haefliger, Georg von Krogh, Sebastian Spaeth, "Code Reuse in Open Source Software", Management Science, vol. 54. no. 1, pages 180-193, 2008
They have many interesting empirical findings based on an in depth study of xfce4, TikiWiki, AbiWord, GNUnet, iRATE, OpenSSL. I'm just pulling out some interesting tidbits:
Knowledge reuse vs. software reuse One needs to separate the idea of just copying bits of code from the idea of learning something from looking at someone else's code and learning something from it. I would argue that knowledge reuse is probably what we really want in machine learning, so even if you think your software is not the cleanest implementation or the most efficient, you should still make it open source (and put it on mloss) so that someone else can learn from it. Sadly, as they point out in the paper, knowledge reuse is really hard to measure.
Reuse of lines of code vs. reuse of software components It seems that in the study, only a very small proportion (less than 1%) of the 6 million lines of code were copied and accredited. It seems that even though in principle one can copy bits of code, developers rarely copy code from some other project. In contrast, all the projects reused external software components. Here, they detected component reuse by effectively looking for "#include" statements from external projects. It seems that this is the dominant sort of reuse that the open source developers use. Such component reuse include the reuse of methods and algorithms from other tools. Basically, developers prefer to write "interesting" code, and just reuse software to plug the gaps for less interesting parts of the pipeline.
It makes sense to me that it would be easier to "link" to a particular software component than to encapsulate it in your project (this is probably obvious to those C/C++ programmers who link all the time). Therefore the advice to those publishing software, is to think carefully about the interfaces, and to document them clearly. Standard software engineering ideas like modular design would increase the chances that some other project would reuse yours.
I probably missed out lots of stuff in the lists above, and would like to hear your thoughts...
No one has posted any comments yet. Perhaps you'd like to be the first?
Leave a comment
You must be logged in to post comments.