Open Thoughts

Missing values

Posted by Cheng Soon Ong on February 2, 2010

We were recently working on a way for efficiently representing data, and came across the problem of missing values. For simple tabular formats with the same type (e.g. all real values), it is convenient to store data as a 2-D array. We are thinking of a Python numpy array, but I'm sure any solution should be language independent. However, very often, datasets contain missing values, which are indicated by some special character, for example by '?' in weka's arff format. Unfortunately, the character '?' is not a real number, hence stuffing up the array.

Does anyone have a suggestion on how to deal with this?

Note that I'm not talking about something like missing value imputation, but just the question of how to represent simple tabular data in computer memory. Of course, the question can be generalized such that some features may have different types from others.

This seems like such a common problem that there must be hundreds of solutions out there...

Comments

Tim Triche (on February 2, 2010, 19:58:49)

Many people use NaN or (if you're a dinosaur) something like SAS's -9 convention. Obviously, the latter is not a very good idea if your data does, in fact, contain negative values. Anyhow, 0/0 is one popular choice that doesn't suck.

Alejandro Dubrovsky (on February 3, 2010, 08:25:10)

PostgreSQL's standard \N seems suitable too

Alejandro Dubrovsky (on February 3, 2010, 08:25:56)

That didn't come out properly, there's meant to be a forward slash in front of the N. Let's try: \N

Soeren Sonnenburg (on February 5, 2010, 08:51:58)

NaN would be nice - does anyone know how this is represented/supported in hdf5?

Soeren Sonnenburg (on February 6, 2010, 08:27:51)

We did some testing here and NaNs are just working nicely with hdf5 so we will stick to them.

Leave a comment

You must be logged in to post comments.