[Pydra] map-reduce - parallel map stage

Jakub Gustak jgustak at gmail.com
Fri Jun 12 07:06:20 UTC 2009

On my [github]_ there is available working version of MapReduceTask
with parallel map-stage. At this point parallelization of reduce-stage
is pretty straightforward. This hopefully ends heavy experimenting

I am more than open to any discussions and suggestions. If it is
required I can write down summary and/or implementation description.
But I hope code (despite few comments) and commit messages are
readable .

One note regarding scheduling of reduce tasks:
To provide coherent output one key must be provided to only one
ReduceTask, therefore it is wise to use simple partition function
which _decide_ on this distribution:

    def partition(self, key):
        return hash(str(key)) % self.reducers

Which requires the number of reduce tasks to be known in advance.
For similar reason ReduceTask shouldn't be called before all keys
belonging to it are provided which in most of the cases means we have
to wait for all MapTasks to finish.

[github] http://github.com/jlg/pydra-map-reduce/tree/master


More information about the Pydra mailing list