[Pydra] map-reduce - parallel map stage
jgustak at gmail.com
Fri Jun 12 07:06:20 UTC 2009
On my [github]_ there is available working version of MapReduceTask
with parallel map-stage. At this point parallelization of reduce-stage
is pretty straightforward. This hopefully ends heavy experimenting
I am more than open to any discussions and suggestions. If it is
required I can write down summary and/or implementation description.
But I hope code (despite few comments) and commit messages are
One note regarding scheduling of reduce tasks:
To provide coherent output one key must be provided to only one
ReduceTask, therefore it is wise to use simple partition function
which _decide_ on this distribution:
def partition(self, key):
return hash(str(key)) % self.reducers
Which requires the number of reduce tasks to be known in advance.
For similar reason ReduceTask shouldn't be called before all keys
belonging to it are provided which in most of the cases means we have
to wait for all MapTasks to finish.
More information about the Pydra