Sevenforge Sevenforge by Curtis Spencer home

Meguro, a simple Javascript Map/Reduce framework

Download

Documentation

Overview

Meguro is a Map/Reduce framework built upon the building blocks of Tokyo Cabinet for data storage and spidermonkey (tracemonkey to be exact) for map/reduce logic. It is inspired by the Javascript map/reduce methods inside of CouchDB. The core idea for Meguro came about because of a need at Kosmix to work easily with a reasonably large dataset that was made up primarily of JSON inside a Tokyo Cabinet database. The dataset wasn't quite large enough to warrant the setup of a Hadoop cluster, but a single threaded ruby process that we were initially using proved to be too slow. Parsing the dataset with a general purpose C++ program was certainly a solution, but it would be great if the core parts of the program such as iteration, threading, I/O, and sorting wouldn't have to be duplicated. Meguro implements most of the dirty work behind the scenes, and the user only needs to implement the fun parts of the library, the map and reduce calls.

Secondly, Meguro provides a basic gateway into the world of map/reduce. There are other simple map/reduce engines out there, but I think coupling it with javascript and a simple executable opens the door for getting more people thinking about their algorithms in map/reduce. Currently it is a non- distributed map/reduce. It only runs as many threads as you give it with the -t parameter, but I am currently looking at Hadoop Streaming as a way to take the same simple JS logic to multiple nodes.

Building Meguro

Meguro requires the following prerequisites:

To then build it you just do the following in your meguro directory

> ./configure
> make 
> sudo make install

Javascript API

The Meguro object has a few methods that you can call from within your javascript map/reduce functions. Learn more about the Javascript API.

An Example

Why don't we do the classic map/reduce word count example, with a slight bent toward the real time web. Let's figure out the top hash tags mentioned on Twitter in some amount of time. First, lets grab some tweets from the Twitter stream API spritzer to give us some data to work with.

$ curl http://stream.twitter.com/1/statuses/sample.json -u twitter_screenname:password

You can also download a datafile that I curl'd earlier here

The Twitter streaming API returns data in a line by line fashion, one JSON hash representing a tweet per line. JSON is something that we can parse natively with Javascript, so it works well with Meguro.

First, we will want to map through the tweets. When using the Meguro line by line iterator, the key will be null. In the following code, we parse the JSON, look at the text attribute, and try to find all the hash tags in it. Each time we see a hash tag, we will emit a '1'.

function map(key,value) {
  var json = JSON.parse(value);
  var text = json.text;
  if (text) {
    words = text.split(/[\s\.:?!]+/)
    for(var i=0; i < words.length; i++) {
      var word = words[i];
      if (word.indexOf('#') == 0)
        Meguro.emit(word,'1');
    }
  }
}

Next, we will want to count up those emits in the reduce step to get the final reduced values. Use Meguro.save to save the key and the resulting value to the reducer output. The value will always be converted to a string.

function reduce(key,values) {
  Meguro.save(key,values.length);
}

So put those two functions in the same file, our.js, that we will use as the javascript input file to Meguro.

$ meguro -j our.js twitter_output.txt

The resulting reduce.out file is a key/value store that has our keys and the counts. You can simply use the meguro executable with the --list option to list the contents of the results in a tab delineated form. Pro tip: reduce.out is just a Tokyo Cabinet hash database so you can use tchmgr to look at it. The following example will just sort the hash tags by frequency.

$ meguro --list reduce.out | LC_ALL='utf8' sort -r -g -k 2 | less

This is just one example of what you can do with Meguro. Please file issues or enhancements over on the github page.


Fork me on GitHub