[Pydra] map-reduce - experiments

Jakub Gustak jgustak at gmail.com
Fri May 29 18:51:00 UTC 2009


First week of coding (horror) behind us. Since I tend to think with a
code instead of a paper I built sequential map-reduce, which runs on a
raw python.

I did experiment with:
- input/output data API,
- intermediate result outputting (by map) and fetching (by reduce)
based on flat-files,
- intermediate data partition for reduce stage;

I did not address:
- intermediate results emitting synchronization/locking issue,
- input data partitioning/splitting for map stage;

I tried to define above with maximum of flexibility, which can give
potential users possibility to write their own data "pushing" helpers
tailored for their expectations.

Proposed input and output API for usage in map and reduce functions
are to resemble python standard objects as much as possible:
- output object is a dict-like (read-only?) object (must provide
__setitem__() method),
- input object is an iterator (must provide __iter__() and next() method);

Below an example of self explainable map and reduce functions solving
problem of counting of word occurrences:

def map_fun(input, output):
    """map for every input item output (word, 1) pair"""

    for word in input:
        # emmit (word, 1)
        output[word.strip()] = 1

def reduce_fun(input, output):
    """sum occurances or each word"""

    d = {}

    # WARNING the same key can appear more than once
    for word, v in input:
        d[word] = sum(v) + d.get(word, 0)

    for word, num in d.iteritems():
        # emmit output (word, num)
        output[word] = num

Complete experimental example you can find on github [gist:120105]_:

I would welcome an suggestions.
Next week I would like to build/port similar map-reduce task to work
under pydra control.

[gist:120105] http://gist.github.com/120105

More information about the Pydra mailing list