Project details for dataformat

Logo dataformat 0.1.1

by mikio - March 12, 2009, 16:07:55 CET [ Project Homepage BibTeX Download ]

view (5 today), download ( 1 today ), 5 comments, 2 subscriptions

Description:

The goal of this project is to provide code for reading and writing machine learning data sets for as many programming languages as possible.

This way it should become much easier to have code written in different languages speak to each other.

Currently, we are focusing on the ARFF file format, developed in the Weka project.

Currently covered languages are:

  • python
  • ruby
  • matlab
  • java

C and C++ are next on the list.

ARFF is covered except for:

  • sparse features
  • date attributes
  • relational attributes
  • missing values
  • instance weights

Some things might not work as expected, in particular

  • strings with commas in them

But we're working on that. :)

Changes to previous version:

Forgot to include the Java sources.

BibTeX Entry: Download
URL: Project Homepage
Supported Operating Systems: Agnostic
Data Formats: None
Tags: Arff, Data Formats, Interoperability
Archive: download here

Other available revisons

Version Changelog Date
0.1.1

Forgot to include the Java sources.

March 12, 2009, 16:07:55
0.1

Initial Announcement on mloss.org.

December 5, 2008, 14:18:01

Comments

Peter Reutemann (on December 14, 2008, 01:55:25)

Since I had the impression that a few people in the data format discussion at the MLOSS NIPS 2008 workshop wanted to know in advance how many attributes and/or instances in the dataset are, I had an inspiration from shell scripts. Even though a "#" indicates a comment, "#!" is not considered a comment and lists the interpreter, e.g., "/bin/bash".

Theoretically, one could use the same approach in the ARFF format to provide additional information. Here's an example:

%!ATTRIBUTES=20,INSTANCES=2000
%
@relation blah

@attribute ...
...

The first line, if it starts with "%!" could contain a comma-separated list of meta-data. This could, e.g., include the number of attributes and instances in the data that is following. It wouldn't be hard to add that to Weka, when outputting ARFF files, since it doesn't break anything. Reading still works as expected, i.e., skipping the comments.

Just my 2c...

Mikio Braun (on December 16, 2008, 14:41:43)

Hi Peter,

hope you had a safe trip back!

I like your idea. Sounds perfectly reasonable to me. Eventually, we could also think about adding more "@"-commands. It should not be a problem as long as it's just one line. That would of course break WEKA.

-M

Peter Reutemann (on December 16, 2008, 22:29:51)

Hi Mikio

Not yet back, waiting for the bus to pick me up...

I think, Mark will be open to suggestions. Especially, if the suggestions are sensible AND contributions, i.e., no extra work for him. ;-)

One of the worst shortcomings of the ARFF format is, that one cannot specify a class attribute. For instance, instead of using "@attribute", one could use "@class" to specify the class attribute.

Cheers, Peter

Rasoul (on March 11, 2009, 10:34:19)

Hi,

Is the source code of this project available?

Thanks, Rasoul

Mikio Braun (on March 12, 2009, 16:08:50)

Hi Rasoul,

sorry, I forgot to include the sources in the tar file. Try out the new version 0.1.1

Best,

Mikio

Leave a comment

You must be logged in to post comments.