[Pydra] map-reduce - parallel reduce stage (and more)
jgustak at gmail.com
Fri Jul 3 13:05:51 UTC 2009
> It looks good. A few things...
> == self.available_workers ==
> The concept of pre-assigned workers will go away with the new
> scheduler. The scheduler will either queue requests or pull requests
> from the tasks. This doesn't need to be updated until the scheduler is
Looking forward to see a new scheduler in action. It was suggested
earlier, the change shouldn't be very painful. So I don't expect big
> == Task as Mapper or Reducer
> My objections to not allowing Task to be used as a Mapper or Reducer
> still stand. It's not going to hold
> up merging in the code though because changing this doesn't alter how
> the rest of the code works.
Now mapper and reducer are expected to be regular tasks. Right now the
only requirement is, they accept two kwargs:
input, output. Which are iterator and AppendableDict respectively.
> == intermediate file ==
> Are you planning on replacing this with the input/output helper
> interface, once it is built?
Yes. I think it is the right moment to figure out how exactly we would
like them to look like.
I was thinking about splitting IntermediateResultFiles to two parts
(e.g. IntermediateOutputFiles (for Mapper) and
IntermediateInputFiles). The only shared part for those two
functionalities is a directory path in which files are stored.
In similar manner InputFiles and OutputFiles can be built.
In such case we would have 4 "helpers" to needed for a map-reduce computation.
Two for a map stage: Input, IntermediateOutput
Two for a reduce stage: IntermediateInput, Output
But IntermediateOutput and IntermediateInput always should be build in
pairs, so the requirement of a particular key being always processed
by the same reduce task is met.
More information about the Pydra