Open Thoughts

March 2008 archive

Does the machine learning community need a data interchange format?

March 16, 2008

While in principle, a standardized data format seems like a good thing, in practice, everyone seems to write their own little data parser. It seems that machine learning researchers find it too troublesome to agree on a standard format. For simple tabular style data, delimited ascii based formats may have the right tradeoff between human readability and efficiency.

For example, the UCI Machine Learning Repository uses simple comma seperated values, with one example per row. Additional information, such as which column contains the label is given in another file.


An alternative to this is a sparse ascii format, where each entry is now an (index,value) pair. LIBSVM uses such a format.

1 1:0.68421 2:-0.616601 3:0.144385 4:-0.484536 5:0.23913
1 1:0.142105 2:-0.588933 3:-0.165775 4:-0.938144 5:-0.347826
1 1:0.121053 2:-0.359684 3:0.40107 4:-0.175258 5:-0.326087
1 1:0.757895 2:-0.521739 3:0.219251 4:-0.360825 5:-0.0652174
1 1:0.163158 2:-0.268775 3:0.614973 4:0.0721649 5:0.0434783

Another possibility is instead of just having a delimited file, there can be an additional 'header' section of the file where metadata is defined. Weka uses the so called ARFF format which has an additional header section before the tabular data begins. Interestingly, there does not seem to be a formal definition of this data format, but instead, Weka defines the format via a set of examples. Recently, an ANTLR definition along with a python implementation of the corresponding lexer/parser has appeared.

Lastly, for those who find ascii too inefficient, there is HDF5 which claims to be highly scalable.

However, the original question remains, do we need to agree on one format, and if so, what should it be? Join our discussion!