Open Thoughts

September 2008 archive

Deadline extension mloss 08

September 30, 2008

Murphy's law has struck us. After happily running for more than a year, the hardware that is running mloss.org is facing some strange difficulties the day before our deadline for mloss 08. So, if you cannot submit, don't panic.

So, to be fair we've decided to extend the deadline to next Monday.

http://mloss.org/workshop/nips08/

Final Call for Contributions: NIPS*08 MLOSS Workshop

September 25, 2008

This is the final call for contributions for the NIPS*08 MLOSS workshop to be held on Friday, December 12th, 2008 in Whistler, British Columbia, Canada.

The deadline for the submissions is approaching quickly, just one week remains until October 1, 2008. We accept all kinds of machine learning (related) software submissions for the workshop. If accepted, you will be given a chance to present your software at the workshop, which is a great opportunity to make your piece of software more known to the NIPS audience and to receive valuable feedback.


We have decided to use mloss.org for managing the submissions. You basically just have to register your project with mloss.org and put the tag nips2008 to it. For more information, have a look at the workshop page.

Data sources

September 19, 2008

For people who are interested in algorithms development, we are often faced with the "have a hammer, looking for a nail" problem. Once we have confirmed that the standard machine learning datasets (for example at UCI ) do not offer a useful application area where does one go? Below, I look at four websites which list data and also software associated with data. The information is not collected with machine learning in mind, and so a user would probably need to write preprocessing scripts to convert stuff into something useful.

A common theme is that just providing blobs of data isn't enough, one has to provide data as well as interfaces or processing tools for it. The other common theme is that these are just listings of data, and not an archival copy.

theinfo

This is a site for large data sets and the people who love them: 
the scrapers and crawlers who collect them, 
the academics and geeks who process them, 
the designers and artists who visualize them. 
It's a place where they can exchange tips and tricks, 
develop and share tools together, and begin to integrate their particular projects.

theinfo.org classifies the activities that people want to do to data into three different ones: get, process, view. In the get section, they provide a list of links to sources of data, which includes things from US congressional district boundaries to stock ticker data which requires a (free) registration. Unfortunately, the list of datasets is a static list, and does not provide useful slicing capabilities. In the view section, there is a nice list of different visualizations of datasets, for example a visualization of trends in twitter or worldmapper which morphs the area of a country to correspond to the size of a certain variable of interest, such as the number of internet users.

However, the really nice thing about this site is that for each section, it lists tools of the trade and tips and tricks which are bits of software which are related to collecting, processing and visualizing data. These are the kinds of things which simplifies our data analysis tasks. There doesn't seem to be a tool for each of the data sources listed yet, which means that a machine learner may still need to write his scraping tool to get data.

infochimps

There are many sources to find out something about everything. 
Until now, there’s been no good place for you to find out everything about something.

This site is still in beta, and currently only provides a list of datasets. They promise to allow uploading of your datasets in the full version. What's nice about the design is that you can slice the list of datasets according to a list of predefined fields or tags. So, in a sense, the design is very much like mloss.org, depending on community involvement to keep the repository fresh and up to date. Most of the data seems to be in tabular format (csv, xls), but they support yaml, which means that in principle more complex structures can exist.

They provide the infinite monkeywrench which is a scripting language to process data.

(the site seems to be having some problems recently, possibly due to the imminent v1.0)

datamob

Datamob highlights the connection between public data sources 
and the interfaces people are building for them

They list hot new datasets and hot new interfaces, which are the latest listings. They have a short list of machine learning data which includes the venerable UCI and also Netflix. There is a simple submit form which allows one to add a link to the source of data or interface. They don't aim to be comprehensive but instead but rather the best place to see how public data is being put to use online. However, it is a pity that the two lists seem to be independent. It would be nice to see which datasets uses which interfaces.

Looking at one of the visualizations (under interfaces) of the 2008 presidential donations, it pointed out something interesting: often when visualizing data, there are not enough pixels on a screen to represent what you want.

ckan

Those familiar with freshmeat, CPAN or PyPI 
can think of CKAN as providing an analogous service for open knowledge.

They package data in a predefined format, which allows them to design an API. In particular, they encourage open data, that is material that people are free to use, reuse and redistribute without restriction. The predefined package allows them to attach much more meta-data to each submission, and in the long run would allow more automated processing. For example, they allow the download of the meta-data of citeseer, which is dublin core compliant with additional metadata fields, including citation relationships (References and IsReferencedBy), author affiliations, and author addresses.

The REST API essentially defines how client software can upload and download data, and allows querying of what resources are available.

NIPS Workshop 2008 accepted

September 8, 2008

We are glad to announce that our workshop at this years NIPS conference has been accepted! We are tentatively scheduled for Friday, December 12th, 2008. The workshop will be held at Whistler, British Columbia, Canada.

We accept software submissions for the workshop. The deadline for the submissions is October 1, 2008. If accepted, you can present your software to the workshop audience, which is a great opportunity to make your piece of software more known to the NIPS audience.

We have decided to use mloss.org for managing the submissions. You basically just have to register your project with mloss.org and put the tag nips2008 to it. For more information, have a look at the workshop page.

New JMLR-MLOSS publication and progress updates for September 2008

September 3, 2008

Again almost two months have passed since the last progress report. Well as Cheng already posted, we finally took the time and made a slightly polished version of the mloss.org source code available.

And the usual statistics follows, mloss.org now has 235 registered users and 129 software projects.

Finally, the mloss project liblinear - a library to very train linear SVMs in very little time - got accepted in JMLR and we again highlight the software interlinking it with the jmlr publication.