[Pydra] Master-Node-Worker relationship refactor

Peter Krenesky peter at osuosl.org
Wed Aug 19 16:12:23 UTC 2009


Yin QIU wrote:
> On Wed, Aug 19, 2009 at 5:31 AM, Peter Krenesky<peter at osuosl.org> wrote:
>   
>> It does not appear to load existing tasks properly.  This happened when
>> i ran the task a second time:
>>
>>  File "/usr/lib/python2.6/dist-packages/twisted/spread/flavors.py",
>> line 114, in remoteMessageReceived
>>    state = method(*args, **kw)
>>  File
>> "/home/peter/wrk/pydra.sync2/pydra/pydra_server/cluster/worker/worker_task_controls.py",
>> line 73, in run_task
>>    workunit_key, main_worker, task_id)
>>  File
>> "/home/peter/wrk/pydra.sync2/pydra/pydra_server/cluster/tasks/task_manager.py",
>> line 304, in retrieve_task
>>    module_search_path = [pydraSettings.tasks_dir, pkg.folder \
>> exceptions.AttributeError: TaskPackage instance has no attribute 'folder'
>>
>>     
>
> Sorry that I missed to test that branch of the control flow. I've
> fixed this issue.
>
>   
Great!

>> Otherwise it works really well, great job!
>>
>>  I didn't test it with larger files but i didn't really notice the
>> extra load time while syncing.  I like how you split out the package
>> loading as well, great way of handling it.
>>
>>     
>
> Good to hear this. Thanks!
>
>   
>> it handles updated files but can it handle multiple versions for the
>> same task handled for updating while a task is running?  I know that one
>> was a bit more complicated and likely required saving files in a
>> different directory structure.
>>     
>
> Technically this wouldn't cause much trouble. Physical locations of
> task packages is now computed simply by concatenating the tasks_dir
> and the package name. We read and write task packages using the
> package name as a key. We can include versions in the lookup too,
> which will allow multiple versions of task packages simultaneously
> exist. By examining the last-modified time of folders, we'll be able
> to tell which folder contains the latest code.
>
> My primary concern is, however, is complicated package deployment
> logic. Imagining we have already got the multi-version-package
> feature, I can think of two major problems (or complications).
>
> 1. The user deploys a task package called "foo" on the master. Now the
> user shouldn't be allowed to directly put a "foo" folder in
> task_cache. Since we may be tracking multiple versions of "foo", the
> contents of "foo" is likely to be saved in "task_cache/foo/v3", where
> "v3", in our implementation, is the calculated sha1 hash of the
> package contents. From another perspective, we should prevent users
> from manipulating task_cache manually, for that would cause
> inconsistency.
>
>   
you're right about this.  A user manually editing it would cause problems. 

What if we store a processed copy of the task?  We provide a TASK_CACHE
directory which users can deploy tasks to.  When task manager reads the
tasks from TASK_CACHE they are copied to TASK_CACHE_INTERNAL if its a
new/updated version.  The tasks that actually get loaded/executed would
be in TASK_CACHE_INTERNAL. 

It duplicates stored code but it allows users to change a file without
affecting the code pydra uses.


> 2. When versions of a task package increases, we'll need to clean up
> older contents. This can be done manually or automatically. Obviously,
> automatic cleanup is more appealing. But to do this, we have to track
> the status of each version of a task package (probably in the
> scheduler). If it is not used by any worker, we can safely delete it.
>
>   
definitely an automatic cleanup.  The only reason for multiple versions
is to allow updates while tasks are running.  we could modify the
scheduler to emit TASK_START and TASK_STOP so that the task manager
could track which packages are in use.
>> -Peter
>>
>>
>>
>>
>> Yin QIU wrote:
>>     
>>> Thanks Peter. That's great news.
>>>
>>> I have managed to make task synchronization work. In my experiment, I
>>> ran the master and a node in two different folders on the same
>>> machine. Initially, the node does not contain any task code in its
>>> task_cache. After the scheduler dispatches a task (I used the simplest
>>> TestTask, others will presumably do too) to the worker, the node will
>>> automatically synchronize the task code with the master. Similar
>>> things will happen too if the task code is updated on the master. As
>>> discussed, task synchronization is done asynchronously.
>>>
>>> Under the hood, two modules, namely TaskSycnClient and TaskSyncServer,
>>> interacts with each other. The former is located on the node, and the
>>> latter resides on the master. Currently, TaskSyncClient is a module of
>>> the Worker ModuleManager. After your refacotoring is done, it won't be
>>> hard to migrate it to the Node ModuleManager.
>>>
>>> Latest changes have been committed to the task_packaging branch on github.
>>>
>>> On Tue, Aug 18, 2009 at 12:44 PM, Peter Krenesky<peter at osuosl.org> wrote:
>>>
>>>       
>>>> Hi all,
>>>>
>>>> I've started refactoring Master, Node, and Worker to change the way in
>>>> which they relate to eachother.  When this refactor is complete Master
>>>> will only communicate with Nodes.   Node will be the only component to
>>>> interact with Workers.  Workers will be spawned per TaskInstance.
>>>>
>>>>
>>>> == WHY? ==
>>>>  - workers need to be chrooted (sandboxed) per TaskInstance to ensure
>>>> no task can affect other users.  Even importing a task file to read task
>>>> name and description puts the cluster at risk.
>>>>
>>>>  - Some libraries, django especially, can only be configured once per
>>>> runtime.  This means changing datasources is not possible under the
>>>> current system.
>>>>
>>>>  - less network overhead from TCP connections.
>>>>
>>>>  - simpler networking logic.
>>>>
>>>>
>>>>
>>>> == How? ==
>>>>
>>>> Master
>>>>     - remove WorkerConnectionManager module
>>>>     - change add_node() so that instead of adding workers to the
>>>> checker, WORKER_CONNECTED signals are emited with a special proxy object
>>>> that mimics a WorkerAvatar but is really the remote from the Node.  This
>>>> allows all other logic in Master to remain the same.
>>>>    - change node disconnection logic to include disconnecting workers
>>>> as well
>>>>
>>>>
>>>> Node
>>>>     - Add WorkerConnectionManager Module, Master's version of this can
>>>> be reused.
>>>>     - Add mechanism for tracking running workers
>>>>     - Add task_run that manages passing work to workers, and starting
>>>> new workers.
>>>>     - Add callback system to task_run to handle asynchronous nature of
>>>> waiting for a worker to start before passing on a task_run
>>>>     - Add remotes that proxy all other functions in
>>>> worker_task_controls to worker avatars
>>>>     - Add remotes that proxy master functions to MasterAvatar.
>>>>
>>>>
>>>> Worker
>>>>    - Modify WorkerConnectionManager to connect locally only and use
>>>> Node key for auth.
>>>>
>>>>
>>>>
>>>> == status ==
>>>>
>>>> Much of the above code in place but it is not tested.  I'll likely have
>>>> it complete within the next few days.
>>>> _______________________________________________
>>>> Pydra mailing list
>>>> Pydra at osuosl.org
>>>> http://lists.osuosl.org/mailman/listinfo/pydra
>>>>
>>>>
>>>>         
>>>
>>>
>>>       
>> _______________________________________________
>> Pydra mailing list
>> Pydra at osuosl.org
>> http://lists.osuosl.org/mailman/listinfo/pydra
>>
>>     
>
>
>
>   



More information about the Pydra mailing list