Open Thoughts

June 2011 archive

Fear of salads

June 30, 2011

At ICML yesterday, I saw two interesting papers about crowd sourcing. "Adaptively Learning the Crowd Kernel" looks at learning similarities between n objects by choosing triplets (a,b,c) and asking human experts to say wether a is nearer to b or c. The second paper, "Active Learning from Crowds" proposes a probabilistic model for choosing both examples and expert annotators actively. Unfortunately, both papers don't seem to have their software available online.

Some weeks ago, there was an outbreak of a particularly menacing strain (O104) of E.coli in Europe. Now, E.coli is one of the most widely studied organisms in biological labs worldwide, and its genome has been one of the first published way back in the last century (1997). This relatively small genome (4.6 million base pairs in length, containing 4288 annotated protein-coding genes) means that it can be sequenced quite quickly. In fact, nine isolates have now been sequenced by five different teams on four different sequencing platforms, including the Ion Torrent, Illumina HiSeq, Roche's 454 GS Junior, and most recently the Illumina MiSeq. From the sequencing perspective, this is really the first time the different next generation sequencing platforms can be compared. There will definitely be some improvements in bioinformatics pipelines once researchers understand the read errors on the different platforms better by comparing them.

All this data has been collected on github, giving an excellent crowd sourced dataset for machine learners. This rich dataset could be used to study evolution, and also to understand the mutations that caused virulence. This provides a great opportunity for the machine learning community to break out of the binary classification mold, and study some interesting new machine learning tasks.