<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>The mloss.org community blog</title><link>http://mloss.org/community</link><description>Some thoughts about machine learning open source software</description><language>en</language><lastBuildDate>Tue, 24 Apr 2012 11:24:18 -0000</lastBuildDate><item><title>Google Summer of Code 2012</title><link>http://mloss.org/community/blog/2012/apr/24/google-summer-of-code-2012/</link><description>&lt;p&gt;The list of Google Summer of Code (GSoC) students for 2012 has been &lt;a href="http://google-opensource.blogspot.com.au/2012/04/students-announced-for-google-summer-of.html"&gt;announced&lt;/a&gt;.
&lt;/p&gt;
&lt;p&gt;For young programmers, it is probably the easiest way to get your foot into the door by showing that you can contribute to something worthwhile. For open source projects, it is an injection of fresh blood. For academics looking for programmer types, it is good way to differentiate between all the applicants with top marks from universities which you personally do not know.
&lt;/p&gt;
&lt;p&gt;Among the mentoring organisations which may be of interest to the machine learning community:
&lt;/p&gt;
&lt;ul&gt;
 &lt;li&gt;
     &lt;a href="http://www.wesnoth.org/"&gt;Battle for Wesnoth&lt;/a&gt; with 5 students
 &lt;/li&gt;

 &lt;li&gt;
     &lt;a href="http://www.cgal.org/"&gt;cgal&lt;/a&gt; with 4 students
 &lt;/li&gt;

 &lt;li&gt;
     &lt;a href="http://cmusphinx.sourceforge.net/"&gt;CMU Sphinx&lt;/a&gt; with 6 students
 &lt;/li&gt;

 &lt;li&gt;
     &lt;a href="http://dbpedia.org/spotlight"&gt;DBpedia spotlight&lt;/a&gt; with 4 students
 &lt;/li&gt;

 &lt;li&gt;
     &lt;a href="http://www.gmod.org/wiki/Main_Page"&gt;Genome Informatics&lt;/a&gt; with 2 students
 &lt;/li&gt;

 &lt;li&gt;
     &lt;a href="https://gephi.org/"&gt;Gephi consortium&lt;/a&gt; with 5 students
 &lt;/li&gt;

 &lt;li&gt;
     &lt;a href="http://www.hedgewars.org/"&gt;Hedgewars project&lt;/a&gt; with 5 students
 &lt;/li&gt;

 &lt;li&gt;
     &lt;a href="http://nrnb.org/gsoc/"&gt;National Resource for Network Biology&lt;/a&gt; with 16 students
 &lt;/li&gt;

 &lt;li&gt;
     &lt;a href="http://www.open-bio.org/wiki/Main_Page"&gt;Open Bioinformatics Foundation&lt;/a&gt; with 5 students
 &lt;/li&gt;

 &lt;li&gt;
     &lt;a href="http://code.opencv.org/projects/OpenCV/wiki/WikiStart"&gt;Open CV&lt;/a&gt; with 12 students
 &lt;/li&gt;

 &lt;li&gt;
     &lt;a href="http://opencog.org/"&gt;OpenCog foundation&lt;/a&gt; with 5 students
 &lt;/li&gt;

 &lt;li&gt;
     &lt;a href="http://orange.biolab.si/"&gt;Orange&lt;/a&gt; with 5 students
 &lt;/li&gt;

 &lt;li&gt;
     &lt;a href="http://www.shogun-toolbox.org/"&gt;shogun&lt;/a&gt; with 8 students, and I'm mentoring here.
 &lt;/li&gt;

 &lt;li&gt;
     &lt;a href="http://simplecv.org/"&gt;SimpleCV&lt;/a&gt; with 4 students
 &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A warm welcome to everyone!
&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Cheng Soon Ong</dc:creator><pubDate>Tue, 24 Apr 2012 11:24:18 -0000</pubDate><guid>http://mloss.org/community/blog/2012/apr/24/google-summer-of-code-2012/</guid></item><item><title>Open Access is very cheap</title><link>http://mloss.org/community/blog/2012/mar/08/open-access-is-very-cheap/</link><description>&lt;p&gt;Stuart Shieber just posted very convincing evidence that publication does not really cost that much, at least in technically savvy fields. Read about it here...
&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.law.harvard.edu/pamphlet/2012/03/06/an-efficient-journal/"&gt;An efficient journal&lt;/a&gt;
&lt;/p&gt;
&lt;p&gt;"Adding it all up, a reasonable imputed estimate for JMLR’s total direct costs other than the volunteered labor (that is, tax accountant, web hosting, domain names, clerical work, etc.) is less than $10,000, covering the almost 1,000 articles the journal has published since its founding — about $10 per article. With regard to whose understanding of JMLR’s financing is better than whose, Yann LeCun I think comes out on top.
   How do I know all this about JMLR? Because (full disclosure alert) I am [ed. the publisher]."
&lt;/p&gt;
&lt;p&gt;This shows that Yann LeCun knew what he was talking about in the argument with Kent Anderson in the &lt;a href="http://scholarlykitchen.sspnet.org/2011/09/01/uninformed-unhinged-and-unfair-the-monbiot-rant/#comments"&gt;comments section of the Scholarly Kitchen&lt;/a&gt;.
&lt;/p&gt;
&lt;p&gt;mloss.org does not cost that much either. It was initially hosted by the Friedrich Miescher Laboratory in Tuebingen, and now hosted by the Technical University Berlin. All coding was done by Soeren, Mikio and I during our free time. mldata.org costed a bit more because we paid a programmer and an intern for a few months at the start, and we also bought a more serious server for the heavier load. Luckily we have a &lt;a href="http://pascallin2.ecs.soton.ac.uk/"&gt;PASCAL2&lt;/a&gt; grant.
&lt;/p&gt;
&lt;p&gt;The real cost, as with JMLR, is the volunteer time needed. In fact, the mloss/mldata team is stretched pretty thin at the moment, and any help would be most welcome. Please contact us if you have a few free hours!
&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Cheng Soon Ong</dc:creator><pubDate>Thu, 08 Mar 2012 10:58:21 -0000</pubDate><guid>http://mloss.org/community/blog/2012/mar/08/open-access-is-very-cheap/</guid></item><item><title>Did the MathWorks Infringe the Competition Laws?</title><link>http://mloss.org/community/blog/2012/mar/02/did-the-mathworks-infringe-the-competition-laws/</link><description>&lt;p&gt;I have just &lt;a href="http://heise.de/-1446391"&gt;read&lt;/a&gt; that the EU commission is investigating whether The MathWorks
   did infringe the EU competition laws potentially related to its software Matlab and Simulink. An unnamed competitor made an appeal to the EU commission claiming that the MathWorks refused to provide a license for Matlab/Simulink to that certain competitor. This hinders making the competing product interoperable and makes it impossible for the competitor to perform (rightful!) reverse engineering  in that case.
&lt;/p&gt;
&lt;p&gt;The original source is here
   &lt;a href="http://europa.eu/rapid/pressReleasesAction.do?reference=IP/12/208&amp;amp;format=HTML&amp;amp;aged=0&amp;amp;language=EN&amp;amp;guiLanguage=en"&gt;http://europa.eu/rapid/pressReleasesAction.do?reference=IP/12/208&amp;amp;format=HTML&amp;amp;aged=0&amp;amp;language=EN&amp;amp;guiLanguage=en&lt;/a&gt;
&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Soeren Sonnenburg</dc:creator><pubDate>Fri, 02 Mar 2012 09:34:01 -0000</pubDate><guid>http://mloss.org/community/blog/2012/mar/02/did-the-mathworks-infringe-the-competition-laws/</guid></item><item><title>Nature Editorial about Open Science</title><link>http://mloss.org/community/blog/2012/feb/28/nature-editorial-about-open-science/</link><description>&lt;p&gt;&lt;em&gt;The case for open computer programs&lt;/em&gt;
&lt;/p&gt;
&lt;p&gt;Does open source software imply reproducible research?
&lt;/p&gt;
&lt;p&gt;There was a recent &lt;a href="http://dx.doi.org/10.1038/nature10836"&gt;Nature editorial&lt;/a&gt; expounding the need for open source software in scientific endeavors. It argues that many modern scientific results depend on complex computations and hence source code is needed for scientific reproducibility. It is nice that a high profile journal has published articles promoting open source software, since it &lt;a href="http://arstechnica.com/science/news/2012/02/science-code-should-be-open-source-according-to-editorial.ars"&gt;increases visibility&lt;/a&gt;. However, some more careful thought is required, as the message of the article is inaccurate in both directions.
&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Open source provides more benefits than just reproducibility&lt;/em&gt;
&lt;/p&gt;
&lt;p&gt;Actually, open source provides more than is necessary for reproducibility, since the &lt;a href="http://www.opensource.org/licenses/alphabetical"&gt;licenses&lt;/a&gt; provides the ability to edit and extend the code, as well as preventing discriminatory practices. To be pedantic, for reproducibility, any software (even a compiled executable) would work.
&lt;/p&gt;
&lt;p&gt;We've said this &lt;a href="http://jmlr.csail.mit.edu/papers/v8/sonnenburg07a.html"&gt;before&lt;/a&gt; but the message is worth repeating. Open source provides:
&lt;/p&gt;
&lt;ol&gt;
 &lt;li&gt;
     reproducibility of scientiﬁc results and fair comparison of algorithms;
 &lt;/li&gt;

 &lt;li&gt;
     uncovering problems;
 &lt;/li&gt;

 &lt;li&gt;
     building on existing resources (rather than re-implementing them);
 &lt;/li&gt;

 &lt;li&gt;
     access to scientiﬁc tools without cease;
 &lt;/li&gt;

 &lt;li&gt;
     combination of advances;
 &lt;/li&gt;

 &lt;li&gt;
     faster adoption of methods in different disciplines and in industry; and
 &lt;/li&gt;

 &lt;li&gt;
     collaborative emergence of standards.
 &lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;See our &lt;a href="http://jmlr.csail.mit.edu/papers/v8/sonnenburg07a.html"&gt;paper&lt;/a&gt; for more details.
&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Having source code does not imply reproducibility&lt;/em&gt;
&lt;/p&gt;
&lt;p&gt;As the editorial observes in the final sentence of the abstract "The vagaries of hardware, software and natural language will always ensure that exact reproducibility remains uncertain, ... ". I've personally spent many frustrating hours trying to get somebody's research code compiled. In fact, one of the most common complaints by reviewers of &lt;a href="http://jmlr.csail.mit.edu/mloss/mloss-info.html"&gt;JMLR's open source track&lt;/a&gt; is that they are unable to get submissions to work on their computer. The multitude of computing environments, numerical libraries and programming languages means that very often, the user of the software is in a different frame of mind compared to the authors of the source code. My advice to fledging authors of machine learning open source software is to provide a "quickstart" tutorial in the README, because everybody is impatient, and nobody will look into fixing your bugs before they are convinced that your code will do something useful for them. And yes, fixing $PATH can be tricky if you don't know exactly how to do it.
&lt;/p&gt;
&lt;p&gt;I guess the bottom line is quite an obvious statement: &lt;em&gt;Good&lt;/em&gt; open source software will give you reproducibility and a few other additional benefits.
&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Cheng Soon Ong</dc:creator><pubDate>Tue, 28 Feb 2012 10:34:51 -0000</pubDate><guid>http://mloss.org/community/blog/2012/feb/28/nature-editorial-about-open-science/</guid></item><item><title>Tagging Project 'Published in JMLR'</title><link>http://mloss.org/community/blog/2012/feb/03/tagging-project-published-in-jmlr/</link><description>&lt;p&gt;I did some minor maintenance work today updating email addresses of certain projects and merging libmlpack and MLPACK.
&lt;/p&gt;
&lt;p&gt;More importantly, I did add the jmlr mloss url to all mloss.org listed projects that have a corresponding jmlr publication. Since there seemed to be quite some backlog it might be worthwhile to shed some light how one gets flagged 'published in jmlr'.
&lt;/p&gt;
&lt;p&gt;Naturally, the condition is an corresponding accepted and already online jmlr mloss paper. In addition, you should remind your jmlr action editor to flag the project 'published in jmlr' by giving him the link to the abstract of your publication on the jmlr homepage. This is the field 'abs' under &lt;a href="http://jmlr.csail.mit.edu/mloss/"&gt;http://jmlr.csail.mit.edu/mloss/&lt;/a&gt; . He will then add this cross-reference and your project benefits from the increased visibility.
&lt;/p&gt;
&lt;p&gt;I hope that makes the process more transparent and reduces the backlog - ohh and btw we are counting &lt;a href="http://mloss.org/software/jmlr/"&gt;29 JMLR-MLOSS submissions&lt;/a&gt; - keep it going :-)
&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Soeren Sonnenburg</dc:creator><pubDate>Fri, 03 Feb 2012 11:19:24 -0000</pubDate><guid>http://mloss.org/community/blog/2012/feb/03/tagging-project-published-in-jmlr/</guid></item><item><title>Improving mloss.org</title><link>http://mloss.org/community/blog/2011/dec/16/improving-mlossorg/</link><description>&lt;p&gt;Meeting Cheng at NIPS we had a discussion on how to improve user experience of mloss.org. So I got my hands dirty and fixed a few minor issues on mloss.org:
&lt;/p&gt;
&lt;p&gt;A long standing feature request was that software that once appeared in JMLR will continue to be tagged published in JMLR and be highlighted. This should now be the case.
&lt;/p&gt;
&lt;p&gt;In addition, I again did limit the automagic pulling of r-cran packages to now happen once / month only. This should give manually updated software a higher visibility again.
&lt;/p&gt;
&lt;p&gt;If you have suggestions for improvements and don't mind to code a little python mloss.org's source code is now availabe on &lt;a href="https://github.com/open-machine-learning"&gt;github&lt;/a&gt;. In particular, if you'd like to attempt to improve the R-CRAN slurper it's code is &lt;a href="https://github.com/open-machine-learning/mloss/blob/master/cran/update_cran.py"&gt;here&lt;/a&gt;.
&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Soeren Sonnenburg</dc:creator><pubDate>Fri, 16 Dec 2011 05:04:29 -0000</pubDate><guid>http://mloss.org/community/blog/2011/dec/16/improving-mlossorg/</guid></item><item><title>Mendeley/PLoS API Binary Battle (winners)</title><link>http://mloss.org/community/blog/2011/dec/06/mendeleyplos-api-binary-battle-winners/</link><description>&lt;p&gt;The &lt;a href="http://www.mendeley.com/blog/design-research-tools/winners-of-the-first-binary-battle-apps-for-science-contest/"&gt;results&lt;/a&gt; of the 
   &lt;a href="https://mloss.org/community/blog/2011/aug/29/mendeleyplos-api-binary-battle/"&gt;Mendeley/PLoS API Binary Battle&lt;/a&gt; are out:
&lt;/p&gt;

&lt;h2&gt;Winner&lt;/h2&gt;
&lt;p&gt;&lt;a href="http://opensnp.org"&gt;openSNP&lt;/a&gt;
&lt;/p&gt;
&lt;p&gt;Share your personal genome from 23andMe or deCODEme to find the latest relevant research and let scientists discover new genetic associations.
&lt;/p&gt;

&lt;h2&gt;1st runner up&lt;/h2&gt;
&lt;p&gt;&lt;a href="http://www.papercritic.com/"&gt;PaperCritic&lt;/a&gt;
&lt;/p&gt;
&lt;p&gt;Continual reviews of papers, even after they are published.
&lt;/p&gt;

&lt;h2&gt;2nd runner up&lt;/h2&gt;
&lt;p&gt;&lt;a href="http://ropensci.org/"&gt;rOpenSci&lt;/a&gt;
&lt;/p&gt;
&lt;p&gt;Something close to my heart, programmatic interface to data!
&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Cheng Soon Ong</dc:creator><pubDate>Tue, 06 Dec 2011 11:33:01 -0000</pubDate><guid>http://mloss.org/community/blog/2011/dec/06/mendeleyplos-api-binary-battle-winners/</guid></item><item><title>What is a file?</title><link>http://mloss.org/community/blog/2011/dec/05/what-is-a-file/</link><description>&lt;p&gt;What is a file?
&lt;/p&gt;
&lt;p&gt;Two bits of news appeared recently:
&lt;/p&gt;
&lt;ul&gt;
 &lt;li&gt;&lt;p&gt;The distribution of file sizes on the internet indicates that &lt;a href="http://www.technologyreview.com/blog/arxiv/27379/"&gt;the human brain limits the amount of produced data&lt;/a&gt;. The article however observes that "it'll be interesting to see how machine intelligence might change this equation. It may be that machines can be designed to distort our relationship with information. If so, then a careful measure of file size distribution could reveal the first signs that intelligent machines are among us!"
&lt;/p&gt;

 &lt;/li&gt;

 &lt;li&gt;&lt;p&gt;Paul Allen's Institute has been &lt;a href="http://online.wsj.com/article/SB10001424052970204630904577058162033028028.html"&gt;publishing its data&lt;/a&gt; in an open fashion. Ironically, the article is behind a paywall. However, the &lt;a href="http://www.alleninstitute.org/"&gt;Allen Institute for Brain Science&lt;/a&gt; has a &lt;a href="http://www.brain-map.org/"&gt;data portal&lt;/a&gt;.
&lt;/p&gt;

 &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I wondered about the distribution of data which is clearly machine generated and in some sense most easily digested by machine as well. It turns out that it is quite difficult to find out how big files are. In some sense, for the brain atlas, the amount of data (&amp;gt;1 petabyte of image data alone) is more than is easily transferable across the internet. Most human users of this data would use some sort of web based visualization of the data, and hence the meaning of the word "file" isn't so obvious. In fact, there has been a recent trend to "hide" the concept of a file. One example is iPhones and iPads where you do not have access to the &lt;a href="http://dottech.org/tipsntricks/18890/four-ways-to-access-your-idevices-iphoneipadipod-touch-file-system-from-your-computer/"&gt;file system&lt;/a&gt;, and hence do not really know whether you are transfering parts of a file or streaming bytes. Another example is Google's AppEngine, where users access data through a &lt;a href="http://code.google.com/appengine/docs/java/datastore/"&gt;database&lt;/a&gt;. A third example is Amazon's Silk browser which &lt;a href="http://gizmodo.com/5844663/what-is-amazon-silk/gallery/1"&gt;"renders"&lt;/a&gt; a web page in a more efficient fashion using Amazon's infrastructure rather than your local client.
&lt;/p&gt;
&lt;p&gt;If we take the extreme view that we use some sort of machine learning algorithm to filter the world's data for our consumption, this implies that all the world's data is in one "file", and we are just looking at parts of it. From this point of view, the paper about using file sizes to reveal machine intelligence is not going to work. In fact, thinking about file sizes in the first place is just plain misleading.
&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Cheng Soon Ong</dc:creator><pubDate>Mon, 05 Dec 2011 17:00:21 -0000</pubDate><guid>http://mloss.org/community/blog/2011/dec/05/what-is-a-file/</guid></item><item><title>Linus's Lessons on Software</title><link>http://mloss.org/community/blog/2011/sep/27/linuss-lessons-on-software/</link><description>&lt;p&gt;Linus Torvalds talks about &lt;a href="http://h30565.www3.hp.com/t5/Feature-Articles/Linus-Torvalds-s-Lessons-on-Software-Development-Management/ba-p/440"&gt;how to run a successful software project&lt;/a&gt;. 
&lt;/p&gt;
&lt;p&gt;Two things people commonly get wrong:
&lt;/p&gt;
&lt;p&gt;“The first thing is thinking that you can throw things out there and ask people to help,”
&lt;/p&gt;
&lt;p&gt;“The other thing—and it's kind of related—that people seem to get wrong is to think that the code they write is what matters,”
&lt;/p&gt;
&lt;p&gt;The main points on how to run a successful project:
&lt;/p&gt;
&lt;ul&gt;
 &lt;li&gt;
     It is not about the code, it is about the user
 &lt;/li&gt;

 &lt;li&gt;
     A good workflow for the project is important, and tools may help to create a good workflow.
 &lt;/li&gt;

 &lt;li&gt;
     For big projects, development happens in small core groups
 &lt;/li&gt;

 &lt;li&gt;
     Let go, and don't try to control the people and the code
 &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Have a look at the &lt;a href="http://h30565.www3.hp.com/t5/Feature-Articles/Linus-Torvalds-s-Lessons-on-Software-Development-Management/ba-p/440"&gt;full article&lt;/a&gt;.
&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Cheng Soon Ong</dc:creator><pubDate>Tue, 27 Sep 2011 16:43:21 -0000</pubDate><guid>http://mloss.org/community/blog/2011/sep/27/linuss-lessons-on-software/</guid></item><item><title>Software Freedom Day</title><link>http://mloss.org/community/blog/2011/sep/17/software-freedom-day/</link><description>&lt;p&gt;Happy software freedom day!
&lt;/p&gt;
&lt;p&gt;&lt;a href="http://www.softwarefreedomday.org/"&gt;http://www.softwarefreedomday.org/&lt;/a&gt;
&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Cheng Soon Ong</dc:creator><pubDate>Sat, 17 Sep 2011 12:52:08 -0000</pubDate><guid>http://mloss.org/community/blog/2011/sep/17/software-freedom-day/</guid></item></channel></rss>
