March 20, 2009

Due to high throughput methods for measuring biological systems and the well developed databases for making the data publicly available, bioinformatics is faced with the problem of too many disparate sources of information. Also, the nature of biological research is such that it is not possible to ignore the big picture, hence a researcher is likely to need access to the different data sources. Recently, two large projects were announced more or less simultaneously, claiming to provide unifying tools for investigating bioinformatics data.

The first, unison aims to be a comprehensive warehouse for all things related to protein sequences. It seems to be already quite developed, with links to many large sources of protein data such as GO, NCBI, SCOP and PDB. One thing I found quite nice is they provide a tool called "BabelFish" which translates between the different naming conventions for proteins. This means that one can match the proteins referred to in different databases, and leverage on the information much more easily. The other interesting thing is that they also consider predictions to be part of "data". While predictions are considered to be second class citizens in the world of bioinformatics, it is usually necessary in poorly studied problems, or problems where measurements are expensive or take a long time. From a machine learning viewpoint, it is definitely a good thing to see. The site gives a warning when predictions are returned.

Warning: These features are from computational predictions, not experimental data. 
Although we filter features based on score or probability to improve specificity, 
the accuracy of these predictions is largely unknown and varies by method and sequence.

What is even nicer from a machine learning point of view is that all the predictions are displayed on the same plot, so one can objectively compare the predictions from various tools for the protein of interest. Furthermore, when the experimental verification is available in future, the tools can be compared objectively.

The second tool that was announced is sage, which currently has a very bare website. However, it is worth a mention here because it is based on internal work from Merck/Rosetta, and hence may provide an integrated environment for studying disease. In an interview of Eric Schadt with Bioinform, he claims that even structural data would be made available. The target launch date is 1 July 2009.