<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>The mloss.org community blog</title><link>http://mloss.org/community</link><description>Some thoughts about machine learning open source software</description><language>en</language><lastBuildDate>Mon, 22 Mar 2010 07:23:05 -0000</lastBuildDate><item><title>Citing Wikipedia</title><link>http://mloss.org/community/blog/2010/mar/22/citing-wikipedia/</link><description>&lt;p&gt;I just stumbled across this &lt;a href="http://etbe.coker.com.au/2010/03/21/citing-wikipedia/"&gt;blog entry&lt;/a&gt; which I found interesting to read.
&lt;/p&gt;
&lt;p&gt;Quoting the first paragraphs from the source above:
&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;Now it’s well known and generally agreed that you can’t cite Wikipedia for a scientific paper or other serious academic work. This makes sense firstly because Wikipedia changes, both in the short term (including vandalism) and in the long term (due to changes in technology, new archaeological discoveries, current events, etc). But you can link to a particular version of a Wikipedia page, you can just click on the history tab at the top of the screen and then click on the date of the version for which you want a direct permanent link.
&lt;/p&gt;
&lt;p&gt;The real reason for not linking to Wikipedia articles in academic publications is that you want to reference the original research not a report on it, which really makes sense. Of course the down-side is that you might reference some data that is in the middle of a 100 page report, in which case you might have to mention the page number as well. Also often the summary of the data you desire simply isn’t available anywhere else, someone might for example take some facts from 10 different pages of a government document and summarise them neatly in a single paragraph on Wikipedia. This isn’t a huge obstacle but just takes more time to create your own summary with references.
&lt;/p&gt;
&lt;/blockquote&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Soeren Sonnenburg</dc:creator><pubDate>Mon, 22 Mar 2010 07:23:05 -0000</pubDate><guid>http://mloss.org/community/blog/2010/mar/22/citing-wikipedia/</guid></item><item><title>Nat Torkington on Open Data</title><link>http://mloss.org/community/blog/2010/mar/09/nat-torkington-on-open-data/</link><description>&lt;p&gt;I recently came across a blog on O'Reilly Radar about &lt;a href="http://radar.oreilly.com/2010/03/truly-open-data.html"&gt;Truly Open Data&lt;/a&gt;, which talks about how concepts from open source software can be translated to open data. Basically, apart from just "getting the data out there", we need software tools for managing this data. I summarize his list of tools below, with some thoughts on how this may apply to machine learning data.
&lt;/p&gt;
&lt;ul&gt;
 &lt;li&gt;
     diff and patch - Perhaps we need some md5sum for binary data? It seems that most machine learners actually don't use "live" data very often, so perhaps these resources are not needed for us?
 &lt;/li&gt;

 &lt;li&gt;
     version control
 &lt;/li&gt;

 &lt;li&gt;
     releases - An obvious release point would be upon submission of a paper. One downside I realized about double blind &lt;a href="http://hunch.net/?p=1086"&gt;reviewing&lt;/a&gt; is that one cannot release new data (or software) upon submission. Some things are just easier to do with some real bits.
 &lt;/li&gt;

 &lt;li&gt;
     documentation - Apart from bioinformatics data that I generated myself, I'd be hard pressed to name one dataset (apart from &lt;a href="http://en.wikipedia.org/wiki/Iris_flower_data_set"&gt;iris&lt;/a&gt;) where I know the provenance of the data.
 &lt;/li&gt;
&lt;/ul&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Cheng Soon Ong</dc:creator><pubDate>Tue, 09 Mar 2010 17:48:03 -0000</pubDate><guid>http://mloss.org/community/blog/2010/mar/09/nat-torkington-on-open-data/</guid></item><item><title>Daniel Lemire on Open Source Software</title><link>http://mloss.org/community/blog/2010/feb/16/daniel-lemire-on-open-source-software/</link><description>&lt;p&gt;Daniel Lemire has an interesting blog post on &lt;a href="http://www.daniel-lemire.com/blog/archives/2010/02/10/open-sourcing-your-software-hurts-your-competitiveness-as-a-researcher/"&gt;whether open sourcing your software affects your competitiveness as a researcher&lt;/a&gt;.
&lt;/p&gt;
&lt;p&gt;In short, here is his summary:
&lt;/p&gt;
&lt;ol&gt;
 &lt;li&gt;
     Sharing can’t hurt the small fish.
 &lt;/li&gt;

 &lt;li&gt;
     Sharing your code makes you more convincing.
 &lt;/li&gt;

 &lt;li&gt;
     Source code helps spread your ideas faster.
 &lt;/li&gt;

 &lt;li&gt;
     Sharing raises your profile in industry.
 &lt;/li&gt;

 &lt;li&gt;
     You write better software if you share it.
 &lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Which is very much in line with why we started the whole initiative in the first place.
&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Mikio L. Braun</dc:creator><pubDate>Tue, 16 Feb 2010 10:38:25 -0000</pubDate><guid>http://mloss.org/community/blog/2010/feb/16/daniel-lemire-on-open-source-software/</guid></item><item><title>MLOSS 2010 - ICML Workshop just accepted</title><link>http://mloss.org/community/blog/2010/feb/12/mloss-2010-icml-workshop-just-accepted/</link><description>&lt;p&gt;We are glad to announce that our &lt;a href="http://mloss.org/workshop/icml10/"&gt;MLOSS 2010 workshop&lt;/a&gt; at this years ICML conference has been accepted!
&lt;/p&gt;
&lt;p&gt;We are therefore happily accepting software submissions. The deadline for the submissions is April 10th, 2010. If accepted, you can present your software to the workshop audience, which is a great opportunity to make your piece of software more known to the machine learning community.
&lt;/p&gt;
&lt;p&gt;Like last time, we will use mloss.org for managing the submissions. You basically just have to register your project with mloss.org and add the tag &lt;code&gt;icml2010&lt;/code&gt; to it. For more information, have a look at the &lt;a href="http://mloss.org/workshop/icml10/"&gt;workshop page&lt;/a&gt;.
&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Soeren Sonnenburg</dc:creator><pubDate>Fri, 12 Feb 2010 19:17:49 -0000</pubDate><guid>http://mloss.org/community/blog/2010/feb/12/mloss-2010-icml-workshop-just-accepted/</guid></item><item><title>Missing values</title><link>http://mloss.org/community/blog/2010/feb/02/missing-values/</link><description>&lt;p&gt;We were recently working on a way for efficiently representing data, and came across the problem of missing values. For simple tabular formats with the same type (e.g. all real values), it is convenient to store data as a 2-D array. We are thinking of a Python numpy array, but I'm sure any solution should be language independent. However, very often, datasets contain missing values, which are indicated by some special character, for example by '?' in &lt;a href="http://www.cs.waikato.ac.nz/~ml/weka/arff.html"&gt;weka's arff format&lt;/a&gt;. Unfortunately, the character '?' is not a real number, hence stuffing up the array.
&lt;/p&gt;
&lt;p&gt;Does anyone have a suggestion on how to deal with this?
&lt;/p&gt;
&lt;p&gt;Note that I'm not talking about something like missing value imputation, but just the question of how to represent simple tabular data in computer memory. Of course, the question can be generalized such that some features may have different types from others.
&lt;/p&gt;
&lt;p&gt;This seems like such a common problem that there must be hundreds of solutions out there...
&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Cheng Soon Ong</dc:creator><pubDate>Tue, 02 Feb 2010 18:52:20 -0000</pubDate><guid>http://mloss.org/community/blog/2010/feb/02/missing-values/</guid></item><item><title>Data and Code Sharing Roundtable</title><link>http://mloss.org/community/blog/2010/jan/26/data-and-code-sharing-roundtable/</link><description>&lt;p&gt;As pointed out by the authors of the mloss position paper [1] in 2007, "reproducibility of experimental results is a cornerstone of science." Just as in machine learning, researchers in many computational fields (or in which computation has only recently played a major role) are struggling to reconcile our expectation of reproducibility in science with the reality of ever-growing computational complexity and opacity. [2-12]
&lt;/p&gt;
&lt;p&gt;In an effort to address these questions from researchers not only from statistical science but from a variety of disciplines, and to discuss possible solutions with representatives from publishing, funding, and legal scholars expert in appropriate licensing for open access, &lt;a href="http://www.law.yale.edu/intellectuallife/informationsocietyproject.htm"&gt;Yale Information Society Project&lt;/a&gt; Fellow &lt;a href="http://www.stanford.edu/~vcs/"&gt;Victoria Stodden&lt;/a&gt; convened a &lt;a href="http://www.stanford.edu/~vcs/Conferences/RoundtableNov212009"&gt;roundtable&lt;/a&gt; on the topic on November 21, 2009. Attendees included statistical scientists such as &lt;a href="http://gentleman.fhcrc.org/"&gt;Robert Gentleman&lt;/a&gt; (co-developer of R) and &lt;a href="http://www-stat.stanford.edu/~donoho"&gt;David Donoho&lt;/a&gt;, among others.
&lt;/p&gt;
&lt;p&gt;The inspiration for this roundtable was the leadership of the genome research community in establishing the open release of sequence data. Representatives from that community gathered in Bermuda in 1996 to develop a cooperative strategy both for genome decoding and for managing and sharing the resulting data. Their meeting resulted in the "Bermuda Principles" [13] that shaped the ensuing data sharing practices among researchers and ensured rapid data release. In the computational research community more generally the incentives and pressures can differ from those in human genome sequencing; consequently, the roundtable sought to consider the issues in a larger context. A second goal of the workshop was to produce a publishable document discussing reactions to data and code sharing in computational science. We also published short topical thought pieces [14] authored by participants, including by statistical scientists [15-16], raising awareness of the issue of reproducibility in computational science.
&lt;/p&gt;
&lt;p&gt;The Data and Code Sharing Roundtable adapted the focus of the genomics community to include access to source code as well as data, across the computational sciences. This echoes mloss's call for "the supporting software and data" to be openly distributed through the mloss repository with links to alternatively stored data collections. The Yale roundtable was organized in five parts: framing issues, examining legal barriers and solutions, considering the role of scientific norms and incentives, discussing how technological tools help and hinder sharing, and finally crafting key points for release in a statement. The agenda is available &lt;a href="http://j.mp/89lDC9"&gt;online&lt;/a&gt; [17] with links to each session's slide decks.
&lt;/p&gt;
&lt;p&gt;The first session framed issues across the disparate fields and was moderated by Harvard Astronomy &lt;a href="http://www.cfa.harvard.edu/~agoodman/"&gt;Professor Alyssa Goodman&lt;/a&gt;, and featured presentations from &lt;a href="http://bioinfo.mbb.yale.edu/"&gt;Mark Gerstein&lt;/a&gt;, the Albert L. Williams Professor of Computational Biology and Bioinformatics at Yale, &lt;a href="http://www.amath.washington.edu/~rjl/"&gt;Randy LeVeque&lt;/a&gt;, the Founders Term Professor of Applied Mathematics at the University of Washington, and Alyssa Goodman herself. The second session was moderated by &lt;a href="http://law.shu.edu/Faculty/display-profile.cfm?customel_datapageid_4018=22642"&gt;Frank Pasquale&lt;/a&gt;, the Loftus Professor of Law at Seton Hall University, and discussed legal barriers to the sharing of research codes and data and presented alternate licensing frameworks to enable sharing. &lt;a href="http://cmgm.stanford.edu/pbrown/Pat_Brown_Lab_Home_Page/Home.html"&gt;Pat Brown&lt;/a&gt;, Professor of Biochemistry at Stanford University, moderated the session on norms and incentives, leading a discussion of publishing models, peer review, and reward structures in the scientific community. The session on computational solutions was moderated by &lt;a href="http://people.cs.ubc.ca/~mitchell/"&gt;Ian Mitchell&lt;/a&gt;, Computer Science Professor at the University of British Columbia, and examined computational solutions (see for example &lt;a href="http://www.mcs.anl.gov/about/people_detail.php?id=327"&gt;Matt Knepley's&lt;/a&gt; &lt;a href="http://www.stanford.edu/~vcs/Nov21/MattKnepley-Yale09.pdf"&gt;slides&lt;/a&gt; from that session). The final session summarized findings and recommendations to be drafted into a jointly authored published statement. The organizers are in the process of creating this "position statement," compiled from the discussions at the workshop and from &lt;a href="http://www.stanford.edu/~vcs/Conferences/RoundtableNov212009/ThoughtPieces.html"&gt;"thought pieces"&lt;/a&gt; contributed by attendees.
&lt;/p&gt;
&lt;p&gt;We invite members of mloss.org to consider contributing such a thought piece, and hope that the open source community within machine learning will find the thought pieces, slides, or position statement useful in promoting distribution of source code as part of the scientific publication process and promoting reproducible computational science more generally.
&lt;/p&gt;
&lt;p&gt;Sincerely,
&lt;/p&gt;
&lt;p&gt;Victoria Stodden&lt;br /&gt;
Yale Law School, New Haven, CT&lt;br /&gt;
Science Commons, Cambridge, MA&lt;br /&gt;

   &lt;a href="http://www.stanford.edu/~vcs"&gt;http://www.stanford.edu/~vcs&lt;/a&gt;&lt;br /&gt;

&lt;/p&gt;
&lt;p&gt;Chris Wiggins&lt;br /&gt;
Department of Applied Physics and Applied Mathematics,&lt;br /&gt;
Columbia University, New York, NY&lt;br /&gt;

   &lt;a href="http://www.columbia.edu/~chw2"&gt;http://www.columbia.edu/~chw2&lt;/a&gt;&lt;br /&gt;

&lt;/p&gt;
&lt;p&gt;References:
&lt;/p&gt;
&lt;ul&gt;
 &lt;li&gt;
     [1] Sonnenburg, "The need for open source software in machine learning" Journal of Machine Learning Research, 8:2443-2466, 2007 &lt;a href="http://j.mp/52JaPJ"&gt;http://j.mp/52JaPJ&lt;/a&gt;; 
 &lt;/li&gt;

 &lt;li&gt;
     [2] Social science: Gary King, the Albert J. Weatherhead III University Professor at Harvard University, has documented his efforts in the social sciences at his website &lt;a href="http://j.mp/4FfCqz"&gt;http://j.mp/4FfCqz&lt;/a&gt;. He also runs The Dataverse Network, a repository for social science data and code &lt;a href="http://thedata.org"&gt;http://thedata.org&lt;/a&gt;; 
 &lt;/li&gt;

 &lt;li&gt;
     [3] Geophysics: Stanford Geophysics Professor Jon Claerbout's efforts in Geoscience: &lt;a href="http://j.mp/7ZHNEe"&gt;http://j.mp/7ZHNEe&lt;/a&gt;; 
 &lt;/li&gt;

 &lt;li&gt;
     [4] Geophysics: University of Texas at Austin Geosciences Professor Sergey Fomel's open source package for reproducible research, Madagascar: &lt;a href="http://j.mp/6UipCZ"&gt;http://j.mp/6UipCZ&lt;/a&gt;; 
 &lt;/li&gt;

 &lt;li&gt;
     [5] Signal processing: Signal Processing at Ecole Polytechnique Federale de Lausanne, Reproducible Research Repository; including Vandewalle, Patrick and Kovacevic, Jelena and Vetterli, Martin (2009) "Reproducible Research in Signal Processing - What, why, and how" IEEE Signal Processing Magazine, 26 (3). pp. 37-47 (&lt;a href="http://j.mp/6Rc5H2"&gt;http://j.mp/6Rc5H2&lt;/a&gt;); 
 &lt;/li&gt;

 &lt;li&gt;
     [6] Databases: The database community tested replication in SIGMOD 2009 submissions; cf. I. Manolescu, L. Afanasiev, A. Arion, J. Dittrich, S. Manegold, N. Polyzotis, K. Schnaitter, P. Senellart, S. Zoupanos, D. Shasha, et al. "The Repeatability Experiment of SIGMOD 2008" SIGMOD Record, 37(1):39, 2008 &lt;a href="http://j.mp/7SWNli"&gt;http://j.mp/7SWNli&lt;/a&gt;; 
 &lt;/li&gt;

 &lt;li&gt;
     [7] Databases: R.V. Nehme. "Black Hole in Database Research" &lt;a href="http://j.mp/4QODKd"&gt;http://j.mp/4QODKd&lt;/a&gt;; 
 &lt;/li&gt;

 &lt;li&gt;
     [8] Climate: "Please, show us your code" RealClimate, Rasmus E. Benestad &lt;a href="http://j.mp/8bj0CS"&gt;http://j.mp/8bj0CS&lt;/a&gt;; 
 &lt;/li&gt;

 &lt;li&gt;
     [9] Economics: BD McCullough. "Got replicability?" The Journal of Money, Banking and Credit Archive. Econ. Journal Watch, 4(3):326-337, 2007 &lt;a href="http://j.mp/6otJMx"&gt;http://j.mp/6otJMx&lt;/a&gt;; 
 &lt;/li&gt;

 &lt;li&gt;
     [10] Linguistics: T. Pedersen. "Empiricism is not a matter of faith" Computational Linguistics, 34(3):465-470, 2008. &lt;a href="http://j.mp/31CwFH"&gt;http://j.mp/31CwFH&lt;/a&gt;; 
 &lt;/li&gt;

 &lt;li&gt;
     [11] Computational Biology: Jill P. Mesirov. "Accessible Reproducible Research" Science 22 January 2010: Vol. 327. no. 5964, pp. 415 - 416 &lt;a href="http://j.mp/54SDTv"&gt;http://j.mp/54SDTv&lt;/a&gt;; 
 &lt;/li&gt;

 &lt;li&gt;
     [12] General sources on reproducibility: &lt;a href="http://www.rrplanet.com/"&gt;http://www.rrplanet.com/&lt;/a&gt; and &lt;a href="http://reproducibleresearch.net/"&gt;http://reproducibleresearch.net/&lt;/a&gt;;
 &lt;/li&gt;

 &lt;li&gt;
     [13] "Bermuda Rules: Community Spirit, With Teeth" Science 16 February 2001: Vol. 291. no. 5507, p. 1192 &lt;a href="http://j.mp/4TP2BV"&gt;http://j.mp/4TP2BV&lt;/a&gt;; 
 &lt;/li&gt;

 &lt;li&gt;
     [14] Thought pieces available via &lt;a href="http://j.mp/4EpcMD"&gt;http://j.mp/4EpcMD&lt;/a&gt;; 
 &lt;/li&gt;

 &lt;li&gt;
     [15] "Reproducible research and genome scale biology: approaches in Bioconductor" Vincent Carey and Robert Gentleman, &lt;a href="http://j.mp/8xlPLR"&gt;http://j.mp/8xlPLR&lt;/a&gt;; 
 &lt;/li&gt;

 &lt;li&gt;
     [16] "View Source" Chris Wiggins &lt;a href="http://j.mp/89lDC9"&gt;http://j.mp/89lDC9&lt;/a&gt;; 
 &lt;/li&gt;

 &lt;li&gt;
     [17] Agenda for roundtable available via &lt;a href="http://j.mp/5MlmUG"&gt;http://j.mp/5MlmUG&lt;/a&gt;.
 &lt;/li&gt;
&lt;/ul&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Victoria Stodden, Chris Wiggins</dc:creator><pubDate>Tue, 26 Jan 2010 14:14:51 -0000</pubDate><guid>http://mloss.org/community/blog/2010/jan/26/data-and-code-sharing-roundtable/</guid></item><item><title>The Open Source Process and Research</title><link>http://mloss.org/community/blog/2010/jan/13/the-open-source-process-and-research/</link><description>&lt;p&gt;(Cross posted on &lt;a href="http://blog.mikiobraun.de/2010/01/open-source-process-and-research.html"&gt;blog.mikiobraun.de&lt;/a&gt;)
&lt;/p&gt;
&lt;p&gt;I think there is more to be learned from the open source software
   development process than just publishing the code from your papers. So
   far, we've mostly focused on making the software side more similar to
   publishing scientific papers, for example, through creating a &lt;a href="http://jmlr.csail.mit.edu/mloss/"&gt;special
open source software track at JMLR&lt;/a&gt;.
&lt;/p&gt;
&lt;p&gt;However, there is more to be learned from the open source software
   development process:
&lt;/p&gt;
&lt;ul&gt;
 &lt;li&gt;
     &lt;a href="http://catb.org/~esr/writings/cathedral-bazaar/cathedral-bazaar/ar01s04.html"&gt;&lt;strong&gt;"Release early, release
  often"&lt;/strong&gt;&lt;/a&gt;
  Open source software is not only about making your software
  available for others to reuse, but it is also about getting in touch
  with potential users as early as possible, as closely as possible.
 &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Contrast this with the typical publication process in science where
   there lie months between your first idea, the submission of the paper,
   its publication, and the reactions through follow-up and response
   papers.
&lt;/p&gt;
&lt;ul&gt;
 &lt;li&gt;
     &lt;strong&gt;Self-organization collaboration&lt;/strong&gt; One nice thing about open source
  software is that you can often find an already sufficiently good
  solution for some part of your problem. This allows you to focus on
  the part which is really new. If existing solutions look
  sufficiently mature and their projects healthy, you might even end
  up relying on others for part of your project, which is really
  interesting given that you don't even know these people or have ever
  talked to them. But if the project is healthy, there is a good
  chance that they will do their best to help you out, because they
  want to have users for their own project.
 &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Again, contrast this with how you usually work in science, where it's
   much more common to collaborate with people from your group or people
   within the same project only. Even if there were someone working on
   something which would be immensely useful for you, you wouldn't know
   till months later when their work is finally published. The effect is
   that there is lots of duplicate work, research results from different
   groups don't usually interact easily, and much potential for
   collaboration and synergy is wasted.
&lt;/p&gt;
&lt;p&gt;While there are certainly reasons while these two areas are different,
   I think there are ways to make research more interactive and open.
   And while probably most people aren't willing to switch to &lt;a href="http://en.wikipedia.org/wiki/Open_Notebook_Science"&gt;open
notebook science&lt;/a&gt;,
   I think there are a few things which you can try out now:
&lt;/p&gt;
&lt;ul&gt;
 &lt;li&gt;&lt;p&gt;Communicate to people through your blog, or by Twitter or Facebook,  and let them know what you're working on, even before you have  polished and published it. And if you don't feel comfortable to  disclose everything, how about some preliminary plots or performance  numbers? To see how others are using social networks to communicate  about their research, have a look at the &lt;a href="http://www.twibes.com/group/machinelearning"&gt;machine learning   twibe&lt;/a&gt;, or my (entirely  non-authoritative) &lt;a href="http://twitter.com/mikiobraun/mlpeople"&gt;list of machine learning  twitterers&lt;/a&gt;, or &lt;a href="http://twitter.com/mikiobraun/lists/memberships"&gt;lists of  machine learning people others have compiled&lt;/a&gt;, or  another &lt;a href="http://blog.mikiobraun.de/2009/10/machine-learning-feed-update.html"&gt;list of machine learning related blogs&lt;/a&gt;.
&lt;/p&gt;

 &lt;/li&gt;

 &lt;li&gt;&lt;p&gt;Release your software as early as possible, and make use of available infrastructure like blogs, mailing lists, issue trackers, or wikis. There are almost infinitely many options to go about this,  either using some site like &lt;a href="http://github.com"&gt;github&lt;/a&gt;,&lt;a href="http://sourceforge.net/"&gt;sourceforge&lt;/a&gt;, &lt;a href="http://kenai.com"&gt;kenai&lt;/a&gt;, &lt;a href="http://launchpad.net"&gt;launchpad&lt;/a&gt;, &lt;a href="http://savannah.gnu.org/"&gt;savannah&lt;/a&gt;, or by setting up a private  repository, for example using [trac] http://trac.edgewall.org/), or  just a bare &lt;a href="http://subversion.tigris.org/"&gt;subversion&lt;/a&gt;  repository. It doesn't have to be that complicated, though. You can  even just put a &lt;a href="http://git-scm.com/"&gt;git&lt;/a&gt; repository on your static  homepage and have people pull from there. And of course, register  your project with &lt;a href="http://mloss.org"&gt;mloss&lt;/a&gt;, such that others can  find it and stay up to date on releases.
&lt;/p&gt;

 &lt;/li&gt;

 &lt;li&gt;&lt;p&gt;Turn your research project into a software project to create
     something others can readily reuse. This means making your software
     usable for others, interface it with existing software, and also,
     start reusing existing software as well. It doesn't have to be
     large if it's useful. Have a look at &lt;a href="http://mloss.org"&gt;mloss&lt;/a&gt; for
     a huge list of already existing machine learning related software
     projects.
&lt;/p&gt;

 &lt;/li&gt;
&lt;/ul&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Mikio Braun</dc:creator><pubDate>Wed, 13 Jan 2010 10:53:43 -0000</pubDate><guid>http://mloss.org/community/blog/2010/jan/13/the-open-source-process-and-research/</guid></item><item><title>MLOSS ICML 2010 workshop?</title><link>http://mloss.org/community/blog/2009/dec/16/mloss-icml-2010-workshop/</link><description>&lt;p&gt;We are thinking of organizing an ICML 2010 workshop on machine learning open source software. Does anyone here think this is a great idea like we do? If you would see this happen, please contact us and help us organize it.
&lt;/p&gt;
&lt;p&gt;Thanks!
&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Soeren Sonnenburg</dc:creator><pubDate>Wed, 16 Dec 2009 12:27:32 -0000</pubDate><guid>http://mloss.org/community/blog/2009/dec/16/mloss-icml-2010-workshop/</guid></item><item><title>US open access policy</title><link>http://mloss.org/community/blog/2009/dec/14/us-open-access-policy/</link><description>&lt;p&gt;The &lt;a href="http://www.ostp.gov/cs/home"&gt;Office of Science and Technology Policy&lt;/a&gt; of the United States of America is having a &lt;a href="http://blog.ostp.gov/2009/12/10/policy-forum-on-public-access-to-federally-funded-research-implementation/"&gt;public consultation&lt;/a&gt; on Public Access Policy, which will run till 7 January 2010. The first part (10-20 December 2009) considers implementation issues, in particular:
&lt;/p&gt;
&lt;ul&gt;
 &lt;li&gt;
     Who should enact public access policies?
 &lt;/li&gt;

 &lt;li&gt;
     How should a public access policy be designed?
 &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The next two sections are (&lt;a href="http://www.whitehouse.gov/blog/2009/12/09/ostp-launch-public-forum-how-best-make-federally-funded-research-results-available-f"&gt;details here&lt;/a&gt;):
&lt;/p&gt;
&lt;ul&gt;
 &lt;li&gt;
     Features and Technology (Dec. 21 to Dec 31)
 &lt;/li&gt;

 &lt;li&gt;
     Management (Jan. 1 to Jan. 7)
 &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you care about how your research is being published, head over and give your views.
&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Cheng Soon Ong</dc:creator><pubDate>Mon, 14 Dec 2009 21:42:44 -0000</pubDate><guid>http://mloss.org/community/blog/2009/dec/14/us-open-access-policy/</guid></item><item><title>Documentation is hard to do</title><link>http://mloss.org/community/blog/2009/dec/04/documentation-is-hard-to-do/</link><description>&lt;p&gt;There was an &lt;a href="http://www.technewsworld.com/story/68798.html"&gt;article at TechNewsWorld&lt;/a&gt; yesterday about the poor state of documentation in Linux. It seems that for most projects, there are two kinds of people: the users and the developers. Users always complain that the documentation is not good enough, and developers don't see the point of writing it. Funnily, once some tech savvy user starts digging around in the code a bit, he/she one day wakes up and finds that they have crossed the fence, i.e. the project which they initially said was badly documented is now what they are actively contributing to. Even worse, they also often don't write documentation themselves.
&lt;/p&gt;
&lt;p&gt;&lt;a href="http://pragprog.com/titles/tpp/the-pragmatic-programmer"&gt;The pragmatic programmer&lt;/a&gt; gives two tips about documentation:
&lt;/p&gt;
&lt;ul&gt;
 &lt;li&gt;
     Treat English as just another programming language
 &lt;/li&gt;

 &lt;li&gt;
     Build documentation in, don't bolt it on
 &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Then it goes on to distinguish between internal and external documentation. I think that for machine learning, the external part is really important. Very often, the users of machine learning software are not experts in the field, and "just" downloaded the code to see whether they can solve their problem. In fact, very often, the user is not even familiar with the programming language that the project is implemented in. Each language has its own idiosyncrasies, and projects should try to have at least a README file that tells the user how to get things working. Some basic things like how to compile, and specific command line operations to get the paths correct, etc. Even interpreted languages can be tricky. For example, matlab often requires the right set of addpath statements to get things working, python requires that $PYTHONPATH be set correctly.
&lt;/p&gt;
&lt;p&gt;It happens quite often that reviewers of JMLR submissions complain of not being able to "get the code working". Sometimes this is due to a deeper problem, but often it is just because the reviewer is not a user of the programming language of the submission. Now, before you criticize me and ask why I don't choose better reviewers; if you take the intersection of machine learning expertise, programming language and operating system, you often end up with only one group of people, namely the ones that submitted the project.
&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Cheng Soon Ong</dc:creator><pubDate>Fri, 04 Dec 2009 14:19:32 -0000</pubDate><guid>http://mloss.org/community/blog/2009/dec/04/documentation-is-hard-to-do/</guid></item></channel></rss>