[Pydra] Prototype solution to ticket #18, etc.

Peter Krenesky peter at osuosl.org
Tue Jun 30 15:32:51 UTC 2009


Hi Yin,

I think this is starting to take shape.  Theres still issues with this
implementation.  To better discuss this I created a page on the wiki to
document what we're doing.  I'd like it to list any benefits or issues
so we can better decide what should be done.

Here is the main issue you need to solve:
    Completed workunits that generate more workunits:
   
http://pydra-project.osuosl.org/wiki/cluster_scheduler#WorkUnitsthatgeneratemoreworkunits


I think that the best solution is actually a cross between push and
pulling requests.  Theres benefits from both methods and i think we can
create a scheduler that is able to have those important features.  I'm
in the process of writing that up, more on it later.

-Peter



Yin QIU wrote:
> Hi,
>
> I think I've finished a prototype solution to ticket #18 (implement a
> basic worker allocation) and have pushed it to my previously mentioned
> repo. Many changes are introduced. Though it's not 100% working yet, I
> believe it should in the near future.
>
> The mechanism is simple according to our previous discussion. Here
> again I'm giving a sketch of it:
>
> * The master maintains a long-term queue (LTQ) and a short-term queue
> (STQ). The former one stores the IDs of those tasks that are yet
> running (similar to Master._queue in original code), and the latter
> one stores the IDs of running tasks. Both queue are actually priority
> queues, ordered by a score computed based on several factors.
> Currently the score is just the priority value of a task. The order of
> LTQ and STQ is updated periodically.
> * When a worker is available, a task in either LTQ or STQ will be
> executed. In my implementation, I prefer a task in LTQ to one in STQ,
> becomes this will increase throughput.
>   - If a task in the LTQ is popped out, it is immediately moved to the
> STQ and is delegated to run on a worker.
>   - If the LTQ is empty, we find the first task in STQ that has
> pending worker requests and serve a worker request. That is, we run a
> work unit of that task on a worker.
> * A task is started without the knowledge of available workers.When it
> needs a worker to complete a work unit, it issues a worker request to
> the master.
> * As with the original design, a worker running a ParallelTask is said
> to be a main worker. A main worker is designed to be able to execute a
> work unit in order to let a ParallelTask complete even with only one
> worker. To adapt to my changes, a main worker is considered a special
> worker resource. This means two things: 1) utilizing a main worker to
> run a work unit requires to explicitly send a request to the master;
> and 2) when a main worker finishes its work unit (not the ParallelTask
> itself), it has to inform the master. My concern is that this may
> cause some side effects, e.g., when stopping tasks. We'll see in the
> future.
>
>
> I have tested the current code with TestTask and it worked fine. I
> also tried to run it with TestParallelTask, but it is blocked by an
> issue which I think is problematic. We know when a ParallelTask is
> finished, that is , all its work units are completed, the
> Worker.work_complete() hook will be invoked at the main worker. This
> hook does the right thing when a worker runs a sub task (i.e., a work
> unit) - it makes a remote call to Master.send_results, which
> subsequently calls the main worker's receive_results(). But if it is
> the root task, the behavior of this hook is incorrect. So, does
> Worker.work_complete() needs some modification to cover this?
>
>
> My prototype does not take the following into consideration yet:
>
> 1)  ParallelTask._work_unit_complete() should return a value to
> indicate whether or not there is a subsequent workrequest. We only
> retain the worker, after the return value indicates there is
> additional work. The problem, as we discussed, is that this may result
> in inefficiency.
> 2) Batching work units to a worker (i.e., optimized slicing).
> 3) Nested task testing.
>
> My plan is to work on 1) and 2) in the following days.
>
> Hope I made myself clear. Any comments are welcomed!
>
>
>   



More information about the Pydra mailing list