[Pydra] map-reduce - parallel reduce stage (and more)

Jakub Gustak jgustak at gmail.com
Fri Jul 3 13:05:51 UTC 2009


> It looks good.  A few things...
>
> == self.available_workers ==
> The concept of pre-assigned workers will go away with the new
> scheduler.  The scheduler will either queue requests or pull requests
> from the tasks.  This doesn't need to be updated until the scheduler is
> finished.

Looking forward to see a new scheduler in action. It was suggested
earlier, the change shouldn't be very painful. So I don't expect big
problems.

> == Task as Mapper or Reducer
> My objections to not allowing Task to be used as a Mapper or Reducer
> still stand.  It's not going to hold
> up merging in the code though because changing this doesn't alter how
> the rest of the code works.

Now mapper and reducer are expected to be regular tasks. Right now the
only requirement is, they accept two kwargs:
input, output. Which are iterator and AppendableDict respectively.

http://github.com/jlg/pydra-map-reduce/commit/4437dc207931f71b831952c236423daefe4f26a0

> == intermediate file ==
> Are you planning on replacing this with the input/output helper
> interface, once it is built?

Yes. I think it is the right moment to figure out how exactly we would
like them to look like.
I was thinking about splitting IntermediateResultFiles to two parts
(e.g. IntermediateOutputFiles (for Mapper) and
IntermediateInputFiles). The only shared part for those two
functionalities is a directory path in which files are stored.

In similar manner InputFiles and OutputFiles can be built.

In such case we would have 4 "helpers" to needed for a map-reduce computation.
Two for a map stage: Input, IntermediateOutput
Two for a reduce stage: IntermediateInput, Output

But IntermediateOutput and IntermediateInput always should be build in
pairs, so the requirement  of a particular key being always processed
by the same reduce task is met.

Tea time,
Jakub


More information about the Pydra mailing list