GSoC 2013
by Cheng Soon Ong on April 9, 2013 (0 comments)
GSoC has just announced the list of participating organisations. This is a great opportunity for students to get involved in projects that matter, and to learn about code development which is bigger than the standard "one semester" programming project that they are usually exposed to at university.
Some statistics:
- 177 of 417 projects were accepted, which is a success rate of 42%.
- 40 of the 177 project are accepted for the first time, which is a 23% proportion of new blood.
These seem to be in the same ballpark as most other competitive schemes for obtaining funding. Perhaps there is some type of psychological "mean" which reviewers gravitate to when they are evaluating submissions. For example, consider that out of the 4258 students that applied for projects in 2012, 1212 students got accepted, a rate of 28%.
To the students out there, please get in touch with potential mentors before putting in your applications. You'd be surprised at how much it could improve your application!
Scientist vs Inventor
by Cheng Soon Ong on March 18, 2013 (1 comment)
Mikio and I are writing a book chapter about "Open Science in Machine Learning", which will appear in a collection titled "Implementing Computational Reproducible Research". Among many things, we mentioned that machine learning is about inventing new methods for solving problems. Luis Ibanez from Kitware pounced on this statement, and proceeded to give a wonderful argument that we are confusing our roles as scientists with the pressure of being an inventor. The rest of this post is an exact reproduction of Luis' response to our statement.
“... machine learning is concerned with creating new learning methods to perform well on certain application problems.”.
The authors discuss the purpose of machine learning, but under the untold context of “research on machine learning”, and the current landscape of funding research. To clarify, the authors imply that novelty is the purpose of machine learning research. More explicitly, that “developing new methods” is the goal of research.
This is a view (not limited to machine learning) that is commonly widespread, and that in practice is confirmed by the requirements of publishing and pursuit of grant funding. I beg to differ with this view, in the sense that “novelty” is not part of the scientific process at all. Novelty is an artificial condition that has been imposed on scientific workers over the years, due to the need to evaluate performance for the purpose of managing scarce funding resources. The goal of scientific research is to attempt to understand the world by direct observation, crafting of hypothesis and evaluation of hypothesis via reproducible experiments.
The pursuit of novelty (real or apparent) is actually a distraction, and it is one of the major obstacles to the practice of reproducible research. By definition, repeating an experiment, implies, requires and demands to do something that is not new. This distracted overrating of novelty is one of the reasons why scientific workers, and their institutions have come to consider repeatability of experiments as a “waste of time”, since it takes resources away from doing “new things” that could be published or could lead to new streams of funding. This confusion with “novelty” is also behind the lack of interest in reproducing experiments that have been performed by third parties. Since, such actions are “just repeating” what someone else did, and are not adding anything “new”. All, statements that are detrimental to the true practice of the scientific method.
The confusion is evident when one look at calls for proposals for papers in journal, conferences, or for funding programs. All of them call for “novelty”, none of them (with a handful of exceptions) call for reproducibility. The net effect is that we have confused two very different professions: (a) scientific researcher, with (b) inventor. Scientific researchers should be committed to the application of the scientific method, and in it, there is no requirement for novelty. The main commitment is to craft reproducible experiments, since we are after the truth, not after the new. Inventors on the other hand are in the business of coming up with new devices, and are not committed to understanding the world around us.
Most conference, journals, and even funding agencies have confused their role of supporting the understanding the world around us, and have become surrogates for the Patent Office.
In order to make progress in the pursuit of reproducible research, we need to put “novelty” back in its rightful place of being a nice extra secondary or tertiary feature of scientific research, but not a requirement, nor a driving force at all.
Software Licensing
by Cheng Soon Ong on February 6, 2013 (1 comment)
One of the tricky decisions software authors have to make is "What license should I use for my software?" A recent article in PLoS Computational Biology discusses the different possible avenues open to authors. It gives a balanced view of software licensing, carefully describing the various dimensions authors of software should consider before coming to a decision.
It recommends the following guidelines:
- For the widest possible distribution consider a permissive FOSS license such as the BSD/MIT, Apache, or ECL.
- For assuring that derivatives also benefit FOSS, choose a copyleft FOSS license like the GPL, LGPL, or MPL.
- To those on the fence, there are hybrid or multi-licensing which can achieve the benefits of both open source and proprietary software licenses.
- For protecting the confidentiality of your code, there is the proprietary license.
Naturally being an open source venue, I strongly encourage people to consider the first two options. We also discuss the distinction between FOSS licences in our position paper from 2007.
Chemical compound and drug name recognition task.
by Martin Krallinger on January 2, 2013 (2 comments)
CALL FOR PARTICIPATION: CHEMDNER task: Chemical compound and drug name recognition task.
( http://www.biocreative.org/tasks/biocreative-iv/chemdner )
TASK GOAL AND MOTIVATION Machine learning methods have been especially useful for the automatic recognition of entity mentions in text, a crucial step for further natural language processing tasks. To promote the development of open source software for indexing documents with compounds and recognizing compound mentions in text.
The goal of this task is to promote the implementation of systems that are able to detect mentions in text of chemical compounds and drugs. The recognition of chemical entities is also crucial for other subsequent text processing strategies, such as detection of drug-protein interactions, adverse effects of chemical compounds or the extraction of pathway and metabolic reaction relations. A range of different methods have been explored for the recognition of chemical compound mentions including machine learning based approaches, rule-based systems and different types of dictionary-lookup strategies.
As has been the case in previous BioCreative efforts (resulting in high impact papers in the field), we expect that successful participants will have the opportunity to publish their system descriptions in a journal article.
CHEMDNER DESCRIPTION The CHEMDNER is one of the tracks posed at the BioCreative IV community challenge (http://www.biocreative.org).
We invite participants to submit results for the CHEMDNER task providing predictions for one or both of the following subtasks:
a) Given a set of documents, return for each of them a ranked list of chemical entities described within each of these documents [Chemical document indexing sub-task]
b) Provide for a given document the start and end indices corresponding to all the chemical entities mentioned in this document [Chemical entity mention recognition sub-task].
For these two tasks the organizers will release training and test data collections. The task organizers will provide details on the used annotation guidelines; define a list of criteria for relevant chemical compound entity types as well as selection of documents for annotation.
REGISTRATION Teams can participate in the CHEMDNER task by registering for track 2 of BioCreative IV. You can register additionally for other tracks too. To register your team go to the following page that provides more detailed instructions: http://www.biocreative.org/news/biocreative-iv/team/
Mailing list and contact information You can post questions related to the CHEMDNER task to the BioCreative mailing list. To register for the BioCreative mailing list, please visit the following page: http://biocreative.sourceforge.net/mailing.html You can also directly send questions to the organizers through e-mail: mkrallinger@cnio[HTML_REMOVED]es
WORKSHOP CHEMDNER is part of the BioCreative evaluation effort. The BioCreative Organizing Committee will host the BioCreative IV Challenge evaluation workshop (http://www.biocreative.org/events/biocreative-iv/CFP/) at NCBI, National Institutes of Health, Bethesda, Maryland, on October 7-9, 2013
CHEMDNER TASK ORGANIZERS Martin Krallinger, Spanish National Cancer Research Center (CNIO) Obdulia Rabal, University of Navarra, Spain Julen Oyarzabal, University of Navarra, Spain Alfonso Valencia, Spanish National Cancer Research Center (CNIO)
REFERENCES - Vazquez, M., Krallinger, M., Leitner, F., & Valencia, A. (2011). Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications. Molecular Informatics, 30(6-7), 506-519. - Corbett, P., Batchelor, C., & Teufel, S. (2007). Annotation of chemical named entities. BioNLP 2007: Biological, translational, and clinical language processing, 57-64. - Klinger, R., Kolářik, C., Fluck, J., Hofmann-Apitius, M., & Friedrich, C. M. (2008). Detection of IUPAC and IUPAC-like chemical names. Bioinformatics, 24(13), i268-i276. - Hettne, K. M., Stierum, R. H., Schuemie, M. J., Hendriksen, P. J., Schijvenaars, B. J., Mulligen, E. M. V., ... & Kors, J. A. (2009). A dictionary to identify small molecules and drugs in free text. Bioinformatics, 25(22), 2983-2991. - Yeh, A., Morgan, A., Colosimo, M., & Hirschman, L. (2005). BioCreAtIvE task 1A: gene mention finding evaluation. BMC bioinformatics, 6(Suppl 1), S2. - Smith, L., Tanabe, L. K., Ando, R. J., Kuo, C. J., Chung, I. F., Hsu, C. N., ... & Wilbur, W. J. (2008). Overview of BioCreative II gene mention recognition. Genome Biology, 9(Suppl 2), S2.
Paper "Ten Simple Rules for the Open Development of Scientific Software" by Prlic and Procter
by Mikio Braun on December 11, 2012 (0 comments)
PLOS Computational Biology has an interesting Editorial on 10 rules for open development of scientific software. The ten rules are:
- Don't Reinvent the Wheel
- Code Well
- Be Your Own User
- Be Transparent
- Be Simple
- Don't Be a Perfectionist
- Nurture and Grow Your Community
- Promote Your Project
- Find Sponsors
- Science Counts.
The full article can be found here.
Best Practices for Scientific Computing
by Cheng Soon Ong on November 28, 2012 (0 comments)
I've been following the progress of Software Carpentry for some years now, and have been very impressed by their message that software is the new telescope, and we should invest time and effort to build up skills to ensure that our software is the best quality possible. Otherwise, how can we be sure that our new discoveries are not due to some instrument error?
They wrote a nice short paper titled "Best Practices for Scientific Computing" that highlights practices that would improve the quality of the software, and hence improve research productivity. Here are the 10 recommendations (along with the sub-recommendations).
1. Write programs for people, not computers.
1.1 a program should not require its readers to hold more than a handful of facts in memory at once
1.2 names should be consistent, distinctive, and meaningful
1.3 code style and formatting should be consistent
1.4 all aspects of software development should be broken down into tasks roughly an hour long
2. Automate repetitive tasks.
2.1 rely on the computer to repeat tasks
2.2 save recent commands in a file for re-use
2.3 use a build tool to automate their scientific workflows
3. Use the computer to record history.
3.1 software tools should be used to track computational work automatically
4. Make incremental changes.
4.1 work in small steps with frequent feedback and course correction
5. Use version control.
5.1 use a version control system
5.2 everything that has been created manually should be put in version control
6. Don’t repeat yourself (or others).
6.1 every piece of data must have a single authoritative representation in the system
6.2 code should be modularized rather than copied and pasted
6.3 re-use code instead of rewriting it
7. Plan for mistakes.
7.1 add assertions to programs to check their operation
7.2 use an off-the-shelf unit testing library
7.3 turn bugs into test cases
7.4 use a symbolic debugger
8. Optimize software only after it works correctly.
8.1 use a profiler to identify bottlenecks
8.2 write code in the highest-level language possible
9. Document the design and purpose of code rather than its mechanics.
9.1 document interfaces and reasons, not implementations
9.2 refactor code instead of explaining how it works
9.3 embed the documentation for a piece of software in that software
10. Conduct code reviews.
10.1 use code review and pair programming when bringing someone new up to speed and when tackling particularly tricky design, coding, and debugging problems
10.2 use an issue tracking tool
Predict Elections with Twitter
by Cheng Soon Ong on October 12, 2012 (0 comments)
In a rather self deprecating title "I wanted to Predict Elections with Twitter and all I got was this Lousy Paper" Daniel Gayo-Avello takes us on a tour of how hard it is to do reproducible research, and how often authors take short cuts. From the abstract:
"Predicting X from Twitter is a popular fad within the Twitter research subculture. It seems both appealing and relatively easy. Among such kind of studies, electoral prediction is maybe the most attractive, and at this moment there is a growing body of literature on such a topic. This is not only an interesting research problem but, above all, it is extremely difficult. However, most of the authors seem to be more interested in claiming positive results than in providing sound and reproducible methods."
It is an interesting survey of papers that use Twitter data.
http://arxiv.org/pdf/1204.6441v1.pdf
He lists some flaws in current research on electoral predictions, but they are generally applicable to any machine learning paper (my comments in brackets):
- It's not prediction at all! I have not found a single paper predicting a future result. (Neither is bootstrap nor cross validation prediction)
- Chance is not a valid baseline...
- There is not a commonly accepted way of "counting votes" in Twitter
- There is not a commonly accepted way of interpreting reality! (In supervised learning, we tend to ignore the fact that there is no ground truth in reality.)
- Sentiment analysis is applied as a black-box... (As machine learning algorithm get more complex, more people will tend to use machine learning software as a black box)
- All the tweets are assumed to be trustworthy. (I don't know if anybody is doing adversarial election prediction)
- Demographics are neglected. (The biased sample problem)
- Self-selection bias.
The window is closing on those who want to predict the upcoming US elections from X.
John Hunter - the author of matplotlib - has died.
by Soeren Sonnenburg on August 30, 2012 (1 comment)
John Hunter the main author of matplotlib has died of cancer. For those interested, his close friend Fernando gives a bit more details here. John was an long term developer of matplotlib (even continuing while he was working in industry) and a father of three kids. You might consider donating to the John Hunter Memorial Fund.
We had John as invited speaker at one of our NIPS machine learning open source software workshops. He gave quite some entertaining talk featuring some live demo. I recall that he started with a command prompt typing everything (including fetching some live stock-exchange data) in python at some insane speed. Videolectures recorded his lecture. I don't know about others but I basically plotted all the scientific results using matplotlib and python for the last several years.
Rest in peace John - your contributions will be remembered.
Machine Learning already matters
by Cheng Soon Ong on June 20, 2012 (0 comments)
"Much of machine learning (ML) research has lost its connection to problems of import to the larger world of science and society." So begins Kiri Wagstaff's position paper that will have a special plenary session on June 29 at ICML 2012. The paper then goes on to lament about the poor state of affairs in machine learning research. The paper is an interesting read, and it addresses an important question that any adolescent field faces: "How do I justify my existence?"
I'd like to take the half full glass view. Machine Learning already matters!.
Kiri herself uses examples that show that machine learning already has impact. In her introduction, she mentions the CALO project, which forms the basis of Siri on the iPhone 4S, which has revolutionised the way the general public perceives human computer interactions. She also mentions spam detection, which Gmail has generalized to sorting all email with Priority Inbox.
A quick look around the web reveals other success stories:
The recent technology quarterly section of the Economist 2 June 2012 edition discusses the use of robots and how we would need to start legislating them. Ironically, in our human desire to appropriate blame in case of failure, we may have to block learning. Quoting the article: "This has implications for system design: it may, for instance, rule out the use of artificial neural networks, decision-making systems that learn from example rather than obeying predefined rules."
Searching for the phrase "machine learning" in PLoS Computational Biology returns 250 hits, showing how machine learning has revolutionised biological research in the high throughput age.
In high energy physics, particle accelerators use anomaly detection algorithms to only save data which may be interesting. The ultimate learning with data streams application.
At NIPS 2008 at the last talk of the Machine Learning in Computational Biology mini-symposium, I had the pleasure to be inspired by Thomas Lengauer's activities proposing anti-HIV therapy. I'd say that this "solves" challenge number 5 in Kiri's list. Remarkably (unfortunately?), their recommendation site, remains just that, a recommendation site, and has yet to navigate the legislative nightmare of getting a website to prescribe drugs. In an answer to a question, he said that Germany was one of the few places in the world where the legislation even allows for doctors to use such drug recommendation sites. A scan of the titles cited by the review article reveals keywords which would fit comfortably in a machine learning venue:
- multiple linear regression
- simple linear model
- prediction-based classification
- artificial neural networks
- self organising feature maps
- non-parametric methods
- sparse models
- convex optimization
But doom and gloom persists. Why? My personal opinion is that like most successful technologies, machine learning fades into the background once it has impact. In that vein of thought, we can measure the impact of machine learning by the decline of ICML, JMLR and friends. Meanwhile, I'm going to go back to making machine learning disappear...
Please join in the discussion at http://mlimpact.com/.
Google Summer of Code 2012
by Cheng Soon Ong on April 24, 2012 (0 comments)
The list of Google Summer of Code (GSoC) students for 2012 has been announced.
For young programmers, it is probably the easiest way to get your foot into the door by showing that you can contribute to something worthwhile. For open source projects, it is an injection of fresh blood. For academics looking for programmer types, it is good way to differentiate between all the applicants with top marks from universities which you personally do not know.
Among the mentoring organisations which may be of interest to the machine learning community:
- Battle for Wesnoth with 5 students
- cgal with 4 students
- CMU Sphinx with 6 students
- DBpedia spotlight with 4 students
- Genome Informatics with 2 students
- Gephi consortium with 5 students
- Hedgewars project with 5 students
- National Resource for Network Biology with 16 students
- Open Bioinformatics Foundation with 5 students
- Open CV with 12 students
- OpenCog foundation with 5 students
- Orange with 5 students
- shogun with 8 students, and I'm mentoring here.
- SimpleCV with 4 students
A warm welcome to everyone!