[Pydra] Master-Node-Worker relationship refactor
peter at osuosl.org
Sun Aug 23 06:19:03 UTC 2009
Branch has been with a more functional version.
Simple tasks can now be run, workers will be cleaned up correctly. Have
not yet tested with parallelTasks, I expect them to be broken but most
of the work should be done. Also have not implemented all of the worker
shutdown code for errors and other cases other than a successful task
Peter Krenesky wrote:
> I have a semi-functional version checked into the node_refactor branch.
> Its been a tedious process moving things around. Its been going slower
> than I hoped but I've made alot of progress tonight.
> This code has the majority of the refactor but is still missing a few
> things for managing the workers. It can run a simple task once.
> * does not kill workers after a task completes
> * does not kill workers after release has been called
> * will only run a task once due to a deferred error
> Peter Krenesky wrote:
>> Hi all,
>> I've started refactoring Master, Node, and Worker to change the way in
>> which they relate to eachother. When this refactor is complete Master
>> will only communicate with Nodes. Node will be the only component to
>> interact with Workers. Workers will be spawned per TaskInstance.
>> == WHY? ==
>> - workers need to be chrooted (sandboxed) per TaskInstance to ensure
>> no task can affect other users. Even importing a task file to read task
>> name and description puts the cluster at risk.
>> - Some libraries, django especially, can only be configured once per
>> runtime. This means changing datasources is not possible under the
>> current system.
>> - less network overhead from TCP connections.
>> - simpler networking logic.
>> == How? ==
>> - remove WorkerConnectionManager module
>> - change add_node() so that instead of adding workers to the
>> checker, WORKER_CONNECTED signals are emited with a special proxy object
>> that mimics a WorkerAvatar but is really the remote from the Node. This
>> allows all other logic in Master to remain the same.
>> - change node disconnection logic to include disconnecting workers
>> as well
>> - Add WorkerConnectionManager Module, Master's version of this can
>> be reused.
>> - Add mechanism for tracking running workers
>> - Add task_run that manages passing work to workers, and starting
>> new workers.
>> - Add callback system to task_run to handle asynchronous nature of
>> waiting for a worker to start before passing on a task_run
>> - Add remotes that proxy all other functions in
>> worker_task_controls to worker avatars
>> - Add remotes that proxy master functions to MasterAvatar.
>> - Modify WorkerConnectionManager to connect locally only and use
>> Node key for auth.
>> == status ==
>> Much of the above code in place but it is not tested. I'll likely have
>> it complete within the next few days.
More information about the Pydra