[Pydra] map-reduce - parallel reduce stage (and more)
peter at osuosl.org
Fri Jul 3 15:03:11 UTC 2009
Jakub Gustak wrote:
>> It looks good. A few things...
>> == self.available_workers ==
>> The concept of pre-assigned workers will go away with the new
>> scheduler. The scheduler will either queue requests or pull requests
>> from the tasks. This doesn't need to be updated until the scheduler is
> Looking forward to see a new scheduler in action. It was suggested
> earlier, the change shouldn't be very painful. So I don't expect big
>> == Task as Mapper or Reducer
>> My objections to not allowing Task to be used as a Mapper or Reducer
>> still stand. It's not going to hold
>> up merging in the code though because changing this doesn't alter how
>> the rest of the code works.
> Now mapper and reducer are expected to be regular tasks. Right now the
> only requirement is, they accept two kwargs:
> input, output. Which are iterator and AppendableDict respectively.
Great! I'll hopefully get a chance to merge it in this weekend. I've
been swamped with other work this past week.
>> == intermediate file ==
>> Are you planning on replacing this with the input/output helper
>> interface, once it is built?
> Yes. I think it is the right moment to figure out how exactly we would
> like them to look like.
> I was thinking about splitting IntermediateResultFiles to two parts
> (e.g. IntermediateOutputFiles (for Mapper) and
> IntermediateInputFiles). The only shared part for those two
> functionalities is a directory path in which files are stored.
> In similar manner InputFiles and OutputFiles can be built.
> In such case we would have 4 "helpers" to needed for a map-reduce computation.
> Two for a map stage: Input, IntermediateOutput
> Two for a reduce stage: IntermediateInput, Output
> But IntermediateOutput and IntermediateInput always should be build in
> pairs, so the requirement of a particular key being always processed
> by the same reduce task is met.
I have a rough idea of how I think it should work, but some details
still need to be filled in.
First we need to settle on terms. I like something like "Datasource"
better than "Helper" as it is more descriptive. "Datasource" implies
input though, and we need something that is input AND output.
The Datasource should be broken into two components:
* Backend - contains logic for connecting and reading/writing. This
would be specific to Filesystem, Memory, memcached, couchdb, etc.
* Slicer/Partitioner - translates between workunits and results into
The reasoning for this is many of the storage systems are very similar.
They generally fall into three categories:
* Key Value Systems - memcached, couchdb
* File systems (multiple files) - simplefilesystem, networkshare,
* Databases - MySql, Django ORM
Slicer/Partitioners might need to be specific to the category of storage
system, but maybe we can figure out how to abstract it better. There is
likely to be overlap such as a "filesystem with multiple files each with
key/value pairs inside". Those types of Datasources will be the most
complicated because the structure of data can be custom to each user.
There is potential to make this very flexible through re-use of
components. This is important because it must be easy to extend or
create custom inputs and outputs. The Datasource api is a translation
service between whatever crazy things users will feed into pydra, and a
common formats of workunits and intermediate files.
More information about the Pydra