[Pydra] map-reduce merge + auto-discovery
jgustak at gmail.com
Thu Aug 13 16:35:10 UTC 2009
>>>>> * is IntermediateResults using the datasource?
>>>> Not yet, I just wanted to confirm if this is what we want.
>>>> But to fully implement ItermediateResults we need also some output(for
>>>> map stage) which should match input (for reduce stage). This is the
>>>> task for upcoming week I suppose.
>>> Right. There should really only be a single datasource configured for
>> What I though so far (at least for map-reduce):
>> im = IntermediateResults()
>> im.map_output = FilePickleJoiner
>> im.reduce_input = FileUnpickleSlicer
>> Since map_output actually has to join all the data into some file, and
>> reduce_input slice them for reduce task.
>> And the InetermediateResults itself will handle only partitioning of
>> data. I've hacked something similar, but it is huge mess right now.
> How is the partitioning different from what the Joiner/Slicer does? It
> seems like there may be overlap here.
For Reduce stage we want to provide all intermediate results which
have the same key to one reduce-task. This is what I call
partitioning. And this is what IntermediateResults does. It decides
which keys will go to which reduce task. This also means writing the
data in a way one can say which reduce-task (ie. partition) it belongs
IntermediateResults generates partition-key for every output data
(every dump() call) and provides it to Joiner.
Joiner stores data in one way or another.
When IntermediateResults provides the same key to Slicer it should get
the same data which were dumped by Joiner for a reduce-task.
>> But here, the issue is reduce getting the same key always. So the
>> FileUnpickleSlicer can not exactly be what we know as Slicer so far.
>> It has to provide iterator for reduce which will slice the work-unit.
>> Therefore it is sth. we called Subslicer.
> Is it an issue with how the data is stored within the file? I could see
> this being hard to format/parse.
Data input for reduce-task is actually a set of key-value pairs. This
is what a Subslicer is for. It provides an iterator for a reduce-task
which reads next element when it is needed (within a reduce-task). It
allows us not to read all the data into memory at once.
Whereas slicer slices set of data to a workunits. And only one
workunit is provided as an input for a task.
It is possible to use Slicers (with a send_as_input=True) as an Subslicers.
Subslicer is simpler. It doesn't need the separation for key
generation (__iter__) and loading the data (load) based on key and
doesn't require random access to a element. It means it cannot be used
as a Slicer.
If we would like to make a Slicer based on pickle.dump()/load() it
would be more complicated than a Subslicer (random access part). And I
am not sure if we need UnpickleSlicer at all for an data input.
More information about the Pydra