Open Thoughts

October 2010 archive

The long hallway

October 14, 2010

I am currently working on a project with someone I've never met, and two others who are almost a thousand km away. This is working on our sister project mldata.org. It's one of those things that people talk about over beer, but nobody attempts to do anything about it. Basically, how to make the experimental part of machine learning available and reproducible. Please check out our motivations and come visit our demo at NIPS.

Shameless plug isn't it?

One thing that we tried during this project that was totally new to me was actively collaborating with people who are physically distant. This was called the long hallway by Johnathan Follett, where he referred to the fact that there are many companies today with virtual offices. The experience is distinct from telecommuting, since there aren't people who are "in office". One thing we found really hard was the fact that we are effectively limited to written communication. We tried using VOIP calls but the lag across the Atlantic and poor conference calling was really quite irritating. So, our normal mode of operation is a weekly chat meeting (using Jabber) and a mailing list. One upside of the written chat meeting is that doing minutes is easier afterwards.

As you can imagine, written text is a poor substitute for face to face communication. Many things are really tough, such as trying to define a new concept. Hand waving arguments do not work (since we don't have hands to wave online), and simple misunderstandings persists for a very long time. One example that kept our mailing list busy for weeks was the concept of a training/validation/test split. One of us assumed that there is one dataset, and each training/validation/test dataset is just a subset of this whole dataset. Another assumed that it would be three different datasets. Everything was fine until we thought about how to implement "hidden labels" in challenges. If we have only one dataset, then this requires hiding part of the label. If we have three datasets, this results in hiding labels for some of the datasets. You can also see in this example that there is a subtle concept of "label" sneaking in already. What is a label? For someone working on simple supervised learning with vectorial data, this is just the relevant column in the matrix. But perhaps there may be multiple possible dependent variables? How about imputing missing values? How do we describe the learning task? What is a solution? Needless to way, such conceptual discussions were very heated, and there have been times when we feel like a good definition is not possible.

Back to the long hallway. I tend to associate a particular face and voice to written text. Since I have never met one of my collaborators, I seemed to have "made up" a particular vision of him, complete with what I think he looks like from the low-res (and probably outdated) photo on his website, and his speech timbre and accent. I was highly disconcerted during our first conference call several months into the project when he spoke in a voice that totally didn't fit my mental image. I continue to be surprised by his voice each time we talk since we don't have phone calls that often. I'm sure I'm going to be surprised by his physical appearance when I see him.