Open Thoughts

Does the machine learning community need a data interchange format?

Posted by Cheng Soon Ong on March 16, 2008

While in principle, a standardized data format seems like a good thing, in practice, everyone seems to write their own little data parser. It seems that machine learning researchers find it too troublesome to agree on a standard format. For simple tabular style data, delimited ascii based formats may have the right tradeoff between human readability and efficiency.

For example, the UCI Machine Learning Repository uses simple comma seperated values, with one example per row. Additional information, such as which column contains the label is given in another file.

1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050
1,13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185
1,14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480
1,13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735

An alternative to this is a sparse ascii format, where each entry is now an (index,value) pair. LIBSVM uses such a format.

1 1:0.68421 2:-0.616601 3:0.144385 4:-0.484536 5:0.23913
1 1:0.142105 2:-0.588933 3:-0.165775 4:-0.938144 5:-0.347826
1 1:0.121053 2:-0.359684 3:0.40107 4:-0.175258 5:-0.326087
1 1:0.757895 2:-0.521739 3:0.219251 4:-0.360825 5:-0.0652174
1 1:0.163158 2:-0.268775 3:0.614973 4:0.0721649 5:0.0434783

Another possibility is instead of just having a delimited file, there can be an additional 'header' section of the file where metadata is defined. Weka uses the so called ARFF format which has an additional header section before the tabular data begins. Interestingly, there does not seem to be a formal definition of this data format, but instead, Weka defines the format via a set of examples. Recently, an ANTLR definition along with a python implementation of the corresponding lexer/parser has appeared.

Lastly, for those who find ascii too inefficient, there is HDF5 which claims to be highly scalable.

However, the original question remains, do we need to agree on one format, and if so, what should it be? Join our discussion!

Comments

Mike Gashler (on June 4, 2008, 00:06:29)

How about if someone just writes a simple open source command-line tool that will convert among these formats?

Advantages of this approach include: 1- Everyone can use the format best-suited for his/her application, 2- It's easier to write this tool than to get everyone to agree on a standard, 3- This tool would become a central location to standardize formats--If you invent a new format, you should add support to this tool.

The obvious drawback is that this discourages use of obscure features in rich formats (because such features would necessarily be stripped when the file is converted to a simpler format). But I propose that this would actually be a good thing because it would prevent the community from drifting toward a bloated standard that would impede progress by being bothersome to implement.

Soeren Sonnenburg (on June 4, 2008, 21:23:23)

I really think we need a standard. Datasets always come in the wrong format and it would be great to at least for the simpler data types have a common format. This way programmers may assume a fixed input format and data suppliers know which format to choose. But yes a converter into and from this format is a really good idea - especially as this is again another counter example for "one size fits all".

Leave a comment

You must be logged in to post comments.