Google Summer of Code 2012
by Cheng Soon Ong on April 24, 2012 (0 comments)
The list of Google Summer of Code (GSoC) students for 2012 has been announced.
For young programmers, it is probably the easiest way to get your foot into the door by showing that you can contribute to something worthwhile. For open source projects, it is an injection of fresh blood. For academics looking for programmer types, it is good way to differentiate between all the applicants with top marks from universities which you personally do not know.
Among the mentoring organisations which may be of interest to the machine learning community:
- Battle for Wesnoth with 5 students
- cgal with 4 students
- CMU Sphinx with 6 students
- DBpedia spotlight with 4 students
- Genome Informatics with 2 students
- Gephi consortium with 5 students
- Hedgewars project with 5 students
- National Resource for Network Biology with 16 students
- Open Bioinformatics Foundation with 5 students
- Open CV with 12 students
- OpenCog foundation with 5 students
- Orange with 5 students
- shogun with 8 students, and I'm mentoring here.
- SimpleCV with 4 students
A warm welcome to everyone!
Open Access is very cheap
by Cheng Soon Ong on March 8, 2012 (0 comments)
Stuart Shieber just posted very convincing evidence that publication does not really cost that much, at least in technically savvy fields. Read about it here...
"Adding it all up, a reasonable imputed estimate for JMLR’s total direct costs other than the volunteered labor (that is, tax accountant, web hosting, domain names, clerical work, etc.) is less than $10,000, covering the almost 1,000 articles the journal has published since its founding — about $10 per article. With regard to whose understanding of JMLR’s financing is better than whose, Yann LeCun I think comes out on top. How do I know all this about JMLR? Because (full disclosure alert) I am [ed. the publisher]."
This shows that Yann LeCun knew what he was talking about in the argument with Kent Anderson in the comments section of the Scholarly Kitchen.
mloss.org does not cost that much either. It was initially hosted by the Friedrich Miescher Laboratory in Tuebingen, and now hosted by the Technical University Berlin. All coding was done by Soeren, Mikio and I during our free time. mldata.org costed a bit more because we paid a programmer and an intern for a few months at the start, and we also bought a more serious server for the heavier load. Luckily we have a PASCAL2 grant.
The real cost, as with JMLR, is the volunteer time needed. In fact, the mloss/mldata team is stretched pretty thin at the moment, and any help would be most welcome. Please contact us if you have a few free hours!
Did the MathWorks Infringe the Competition Laws?
by Soeren Sonnenburg on March 2, 2012 (0 comments)
I have just read that the EU commission is investigating whether The MathWorks did infringe the EU competition laws potentially related to its software Matlab and Simulink. An unnamed competitor made an appeal to the EU commission claiming that the MathWorks refused to provide a license for Matlab/Simulink to that certain competitor. This hinders making the competing product interoperable and makes it impossible for the competitor to perform (rightful!) reverse engineering in that case.
The original source is here http://europa.eu/rapid/pressReleasesAction.do?reference=IP/12/208&format=HTML&aged=0&language=EN&guiLanguage=en
Nature Editorial about Open Science
by Cheng Soon Ong on February 28, 2012 (0 comments)
The case for open computer programs
Does open source software imply reproducible research?
There was a recent Nature editorial expounding the need for open source software in scientific endeavors. It argues that many modern scientific results depend on complex computations and hence source code is needed for scientific reproducibility. It is nice that a high profile journal has published articles promoting open source software, since it increases visibility. However, some more careful thought is required, as the message of the article is inaccurate in both directions.
Open source provides more benefits than just reproducibility
Actually, open source provides more than is necessary for reproducibility, since the licenses provides the ability to edit and extend the code, as well as preventing discriminatory practices. To be pedantic, for reproducibility, any software (even a compiled executable) would work.
We've said this before but the message is worth repeating. Open source provides:
- reproducibility of scientific results and fair comparison of algorithms;
- uncovering problems;
- building on existing resources (rather than re-implementing them);
- access to scientific tools without cease;
- combination of advances;
- faster adoption of methods in different disciplines and in industry; and
- collaborative emergence of standards.
See our paper for more details.
Having source code does not imply reproducibility
As the editorial observes in the final sentence of the abstract "The vagaries of hardware, software and natural language will always ensure that exact reproducibility remains uncertain, ... ". I've personally spent many frustrating hours trying to get somebody's research code compiled. In fact, one of the most common complaints by reviewers of JMLR's open source track is that they are unable to get submissions to work on their computer. The multitude of computing environments, numerical libraries and programming languages means that very often, the user of the software is in a different frame of mind compared to the authors of the source code. My advice to fledging authors of machine learning open source software is to provide a "quickstart" tutorial in the README, because everybody is impatient, and nobody will look into fixing your bugs before they are convinced that your code will do something useful for them. And yes, fixing $PATH can be tricky if you don't know exactly how to do it.
I guess the bottom line is quite an obvious statement: Good open source software will give you reproducibility and a few other additional benefits.
Tagging Project 'Published in JMLR'
by Soeren Sonnenburg on February 3, 2012 (0 comments)
I did some minor maintenance work today updating email addresses of certain projects and merging libmlpack and MLPACK.
More importantly, I did add the jmlr mloss url to all mloss.org listed projects that have a corresponding jmlr publication. Since there seemed to be quite some backlog it might be worthwhile to shed some light how one gets flagged 'published in jmlr'.
Naturally, the condition is an corresponding accepted and already online jmlr mloss paper. In addition, you should remind your jmlr action editor to flag the project 'published in jmlr' by giving him the link to the abstract of your publication on the jmlr homepage. This is the field 'abs' under http://jmlr.csail.mit.edu/mloss/ . He will then add this cross-reference and your project benefits from the increased visibility.
I hope that makes the process more transparent and reduces the backlog - ohh and btw we are counting 29 JMLR-MLOSS submissions - keep it going :-)
Improving mloss.org
by Soeren Sonnenburg on December 16, 2011 (0 comments)
Meeting Cheng at NIPS we had a discussion on how to improve user experience of mloss.org. So I got my hands dirty and fixed a few minor issues on mloss.org:
A long standing feature request was that software that once appeared in JMLR will continue to be tagged published in JMLR and be highlighted. This should now be the case.
In addition, I again did limit the automagic pulling of r-cran packages to now happen once / month only. This should give manually updated software a higher visibility again.
If you have suggestions for improvements and don't mind to code a little python mloss.org's source code is now availabe on github. In particular, if you'd like to attempt to improve the R-CRAN slurper it's code is here.
Mendeley/PLoS API Binary Battle (winners)
by Cheng Soon Ong on December 6, 2011 (0 comments)
The results of the Mendeley/PLoS API Binary Battle are out:
Winner
Share your personal genome from 23andMe or deCODEme to find the latest relevant research and let scientists discover new genetic associations.
1st runner up
Continual reviews of papers, even after they are published.
2nd runner up
Something close to my heart, programmatic interface to data!
What is a file?
by Cheng Soon Ong on December 5, 2011 (0 comments)
What is a file?
Two bits of news appeared recently:
The distribution of file sizes on the internet indicates that the human brain limits the amount of produced data. The article however observes that "it'll be interesting to see how machine intelligence might change this equation. It may be that machines can be designed to distort our relationship with information. If so, then a careful measure of file size distribution could reveal the first signs that intelligent machines are among us!"
Paul Allen's Institute has been publishing its data in an open fashion. Ironically, the article is behind a paywall. However, the Allen Institute for Brain Science has a data portal.
I wondered about the distribution of data which is clearly machine generated and in some sense most easily digested by machine as well. It turns out that it is quite difficult to find out how big files are. In some sense, for the brain atlas, the amount of data (>1 petabyte of image data alone) is more than is easily transferable across the internet. Most human users of this data would use some sort of web based visualization of the data, and hence the meaning of the word "file" isn't so obvious. In fact, there has been a recent trend to "hide" the concept of a file. One example is iPhones and iPads where you do not have access to the file system, and hence do not really know whether you are transfering parts of a file or streaming bytes. Another example is Google's AppEngine, where users access data through a database. A third example is Amazon's Silk browser which "renders" a web page in a more efficient fashion using Amazon's infrastructure rather than your local client.
If we take the extreme view that we use some sort of machine learning algorithm to filter the world's data for our consumption, this implies that all the world's data is in one "file", and we are just looking at parts of it. From this point of view, the paper about using file sizes to reveal machine intelligence is not going to work. In fact, thinking about file sizes in the first place is just plain misleading.
Linus's Lessons on Software
by Cheng Soon Ong on September 27, 2011 (1 comment)
Linus Torvalds talks about how to run a successful software project.
Two things people commonly get wrong:
“The first thing is thinking that you can throw things out there and ask people to help,”
“The other thing—and it's kind of related—that people seem to get wrong is to think that the code they write is what matters,”
The main points on how to run a successful project:
- It is not about the code, it is about the user
- A good workflow for the project is important, and tools may help to create a good workflow.
- For big projects, development happens in small core groups
- Let go, and don't try to control the people and the code
Have a look at the full article.
Software Freedom Day
by Cheng Soon Ong on September 17, 2011 (0 comments)
Happy software freedom day!