mloss.org Discohttp://mloss.orgUpdates and additions to DiscoenMon, 06 Oct 2008 11:14:48 -0000Disco 0.1http://mloss.org/software/view/140/<html><p>Disco is an open-source implementation of the <a href="http://en.wikipedia.org/wiki/MapReduce">Map-Reduce framework</a> for distributed computing. As the original framework, Disco supports parallel computations over large data sets on unreliable cluster of computers. You don't need a cluster to use Disco -- a script is provided that installs Disco automatically to the <a href="http://aws.amazon.com">Amazon's EC2 computing cloud</a> where you get computing resources on demand basis. </p> <p>The Disco core is written in Erlang, a functional language that is designed for building robust fault-tolerant distributed applications. Users of Disco typically write jobs in Python, which makes it possible to express even complex algorithms or data processing tasks often only in tens of lines of code. This means that you can quickly write scripts to process massive amounts of data. </p> <p>Disco was started at <a href="http://research.nokia.com">Nokia Research Center</a> as a lightweight framework for rapid scripting of distributed data processing tasks. This far Disco has been succesfully used, for instance, in parsing and reformatting data, data clustering, probabilistic modelling, data mining, full-text indexing, and log analysis with hundreds of gigabytes of real-world data on hundreds of CPUs in parallel. </p> <p>Many well-known machine learning and data mining methods map cleanly to the Map/Reduce framework (see <a href="http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf">Map-Reduce for Machine Learning on Multicore</a> for examples). </p> <p>Disco includes example implementations of the following ML methods: </p> <ul> <li> <a href="http://github.com/tuulos/disco/tree/master/examples/datamining/naive_bayes.py">Naive Bayes</a> </li> <li> <a href="http://github.com/tuulos/disco/tree/master/examples/datamining/kmeans.py">K-means</a> </li> <li> <a href="http://github.com/tuulos/disco/tree/master/examples/datamining/naive_linreg.py">Linear regression</a> </li> <li> <a href="http://github.com/tuulos/disco/tree/master/examples/datamining/perceptron.py">Perceptron</a> </li> <li> <a href="http://github.com/tuulos/disco/tree/master/examples/datamining/widrowhoff.py">Widrow-Hoff Learning </a> </li> </ul> <p>In addition to these examples, we know that Disco has been used at least in the following tasks: </p> <ul> <li> Learning Hidden Markov Models </li> <li> Frequent itemset mining </li> <li> Full-text indexing </li> </ul></html>Nokia Research CenterMon, 06 Oct 2008 11:14:48 -0000http://mloss.org/software/rss/comments/140http://mloss.org/software/view/140/large scalelarge scale learningdistributedframeworkdata miningnips2008mapreduce