March 2010 archive
March 28, 2010
I am currently doing some refactoring of small bits of research code that I've written, and like many others before me, I've come to the conclusion that some sort of toolbox structure is appropriate for my project. Subscribing to the unix philosopy of writing small bits of code that talk to each other, I tried to see how this would apply to a typical machine learning project.
My interest lies in algorithms and I tend to work with discriminative supervised learning methods, so perhaps my design choices are biased by this. I'd be very happy to hear what other people do with their projects. I believe that there should be three types of toolboxes:
- Data handling - including file format handling, feature creation and preprocessing, normalization, etc.
- Learning objectives - which define the mathematical objects that we are searching through, for example hinge loss versus logistic loss, l1 versus l2 regularization. I merge kernels into this part, instead of data handling, because it really is dependent on the type of learning algorithm.
- Numerical tools - such as convex optimization or stochastic gradient descent.
On top of that, in the interests of reproducible research, for each paper, there should be an "experimental scripts" directory that shows how to go from raw data using the toolboxes (+versions) above to the plots and tables in a particular paper.
Most projects tend to monolithic, i.e. they merge all three types of toolboxes into one project. I believe that this is due to our culture of writing a piece of code for a particular paper deadline, effectively giving a bunch of experimental scripts that include all code for data handling, mathematical objects and optimization. Often the argument is that this is the only way to make code efficient, but it also means that code has to be rewritten time and again for basic things such as computing the ROC of a classifier, or doing trace normalization of a kernel matrix, or doing "simple gradient descent". For such "easy" things, it may be actually less overhead to just recode things in your own framework, but for potentially more difficult things, such as using cuda, it would be convenient if the numerical tools library took care of it once and for all.
My current project design (in python) is also monolithic, but I intend to have different packages for data, classifiers and optimization corresponding to the three items above. Experimental scripts for reproducible research are not part of the project, but part of the paper, since I do not want to think about backward compatibility. I mean, should new versions of my code still reproduce old results, or should results be for a particular project version? I'm also using the project structure recommended by this post and this post.
Any tips from more experienced readers are most welcome! Especially on how to keep the code base flexible for future research projects.
March 22, 2010
I just stumbled across this blog entry which I found interesting to read.
Quoting the first paragraphs from the source above:
Now it’s well known and generally agreed that you can’t cite Wikipedia for a scientific paper or other serious academic work. This makes sense firstly because Wikipedia changes, both in the short term (including vandalism) and in the long term (due to changes in technology, new archaeological discoveries, current events, etc). But you can link to a particular version of a Wikipedia page, you can just click on the history tab at the top of the screen and then click on the date of the version for which you want a direct permanent link.
The real reason for not linking to Wikipedia articles in academic publications is that you want to reference the original research not a report on it, which really makes sense. Of course the down-side is that you might reference some data that is in the middle of a 100 page report, in which case you might have to mention the page number as well. Also often the summary of the data you desire simply isn’t available anywhere else, someone might for example take some facts from 10 different pages of a government document and summarise them neatly in a single paragraph on Wikipedia. This isn’t a huge obstacle but just takes more time to create your own summary with references.
March 9, 2010
I recently came across a blog on O'Reilly Radar about Truly Open Data, which talks about how concepts from open source software can be translated to open data. Basically, apart from just "getting the data out there", we need software tools for managing this data. I summarize his list of tools below, with some thoughts on how this may apply to machine learning data.
- diff and patch - Perhaps we need some md5sum for binary data? It seems that most machine learners actually don't use "live" data very often, so perhaps these resources are not needed for us?
- version control
- releases - An obvious release point would be upon submission of a paper. One downside I realized about double blind reviewing is that one cannot release new data (or software) upon submission. Some things are just easier to do with some real bits.
- documentation - Apart from bioinformatics data that I generated myself, I'd be hard pressed to name one dataset (apart from iris) where I know the provenance of the data.