Open Thoughts

How do you structure your code?

Posted by Cheng Soon Ong on March 28, 2010

I am currently doing some refactoring of small bits of research code that I've written, and like many others before me, I've come to the conclusion that some sort of toolbox structure is appropriate for my project. Subscribing to the unix philosopy of writing small bits of code that talk to each other, I tried to see how this would apply to a typical machine learning project.

My interest lies in algorithms and I tend to work with discriminative supervised learning methods, so perhaps my design choices are biased by this. I'd be very happy to hear what other people do with their projects. I believe that there should be three types of toolboxes:

  • Data handling - including file format handling, feature creation and preprocessing, normalization, etc.
  • Learning objectives - which define the mathematical objects that we are searching through, for example hinge loss versus logistic loss, l1 versus l2 regularization. I merge kernels into this part, instead of data handling, because it really is dependent on the type of learning algorithm.
  • Numerical tools - such as convex optimization or stochastic gradient descent.

On top of that, in the interests of reproducible research, for each paper, there should be an "experimental scripts" directory that shows how to go from raw data using the toolboxes (+versions) above to the plots and tables in a particular paper.

Most projects tend to monolithic, i.e. they merge all three types of toolboxes into one project. I believe that this is due to our culture of writing a piece of code for a particular paper deadline, effectively giving a bunch of experimental scripts that include all code for data handling, mathematical objects and optimization. Often the argument is that this is the only way to make code efficient, but it also means that code has to be rewritten time and again for basic things such as computing the ROC of a classifier, or doing trace normalization of a kernel matrix, or doing "simple gradient descent". For such "easy" things, it may be actually less overhead to just recode things in your own framework, but for potentially more difficult things, such as using cuda, it would be convenient if the numerical tools library took care of it once and for all.

My current project design (in python) is also monolithic, but I intend to have different packages for data, classifiers and optimization corresponding to the three items above. Experimental scripts for reproducible research are not part of the project, but part of the paper, since I do not want to think about backward compatibility. I mean, should new versions of my code still reproduce old results, or should results be for a particular project version? I'm also using the project structure recommended by this post and this post.

Any tips from more experienced readers are most welcome! Especially on how to keep the code base flexible for future research projects.

Comments

Davis King (on March 28, 2010, 16:18:06)

I would try to apply contract programming to the routines in your toolbox. That is, basically, write down what the requirements are for calling each routine and also what each routine accomplishes. It's something that is usually easy to do and pays off enormously in the long run. I say "usually" because sometimes it's hard to describe what something does. But if you can't describe how to use part of your toolbox to another human then that's probably a sign that this part is overly complex, not very flexible, and in need of refactoring. So contract programming has the nice side effect of helping to identify these problems.

I use it everywhere in dlib and I'm quite confident that I wouldn't have been able to manage this project without contract programming. It surely would have turned into a huge mess a long time ago :)

That's my 2 cents anyway.

Mike Gashler (on April 16, 2010, 17:10:53)

I think it is monolithic apps, not monolithic libraries that violate the unix philosophy. Big toolboxes can still be well-organized, especially with proper use of namespaces. Besides, the linker will only pull in code that you actually use, not the whole toolbox, so big libraries don't mean bloat. (I'm more of a C/C++ person, but I assume Python has this property in common.) You can have lots of little apps built around a big library.

I think it's good to have a clear separation between general-purpose code and app-specific code. I keep all my general-purpose code well-polished, and just quickly throw together little apps as necessary for my research.

For interfaces, I advise using as few required parameters as strictly necessary. When writing the code, the parameters always seem so intuitive, but later you'll wish you didn't have to think about them. Thus, almost everything should have a default value. You can still make your tools arbitrarily flexible with optional calls to configure various settings and parameters.

Leave a comment

You must be logged in to post comments.