What is Disco?

Disco is an open-source implementation of the Map-Reduce framework for distributed computing. As the original framework, Disco supports parallel computations over large data sets on unreliable cluster of computers.

The Disco core is written in Erlang, a functional language that is designed for building robust fault-tolerant distributed applications. Users of Disco typically write jobs in Python, which makes it possible to express even complex algorithms or data processing tasks often only in tens of lines of code. This means that you can quickly write scripts to process massive amounts of data.

Disco was started at Nokia Research Center as a lightweight framework for rapid scripting of distributed data processing tasks. This far Disco has been succesfully used, for instance, in parsing and reformatting data, data clustering, probabilistic modelling, data mining, full-text indexing, and log analysis with hundreds of gigabytes of real-world data.

How to get started

Learn about Disco by reading the documentation. Once you are ready to give it a try, follow the setup instructions. You don't need a cluster to run it - any multi-core machine can benefit from Disco. Currently Linux is the only supported platform.

Do you need more than a single machine? No problem - you can run Disco in the Amazon's Elastic Computing Cloud.

Need help with Disco? We can be reached on our IRC channel #discoproject at Freenode or on the Disco discussion group.

Get involved

Clone your own Disco repository at GitHub and join to our mailings list and IRC channel. Even if you don't want to dive into Erlang, Python or Javascript code, you can help us by giving feedback!

If large-scale data analysis, distributed computing, or data visualization is your passion, and you'd be happy to develop Disco and related techniques full-time, see here for more information - we're hiring!

from disco.core import Disco, result_iterator

def fun_map(e, params):
    return [(w, 1) for w in e.split()]

def fun_reduce(iter, out, params):
    s = {}
    for w, f in iter:
        s[w] = s.get(w, 0) + int(f)
    for w, f in s.iteritems():
        out.add(w, f)

results = Disco("disco://localhost").new_job(
		name = "wordcount",
                input = ["http://discoproject.org/chekhov.txt"],
                map = fun_map,
		reduce = fun_reduce).wait()

for word, frequency in result_iterator(results):
	print word, frequency
This is a fully working Disco script that computes word frequencies in a text corpus. Disco distributes the script automatically to a cluster, so it can utilize all available CPUs in parallel. For details, see Disco tutorial.
Last modified: Wed Dec 10 23:49:19 UTC 2008