[Pydra] Master-Node-Worker relationship refactor

Peter Krenesky peter at osuosl.org
Thu Aug 20 14:15:45 UTC 2009


Yin QIU wrote:
> On Thu, Aug 20, 2009 at 12:12 AM, Peter Krenesky<peter at osuosl.org> wrote:
>   
>> Yin QIU wrote:
>>     
>>> On Wed, Aug 19, 2009 at 5:31 AM, Peter Krenesky<peter at osuosl.org> wrote:
>>>
>>>       
>>>> It does not appear to load existing tasks properly.  This happened when
>>>> i ran the task a second time:
>>>>
>>>>  File "/usr/lib/python2.6/dist-packages/twisted/spread/flavors.py",
>>>> line 114, in remoteMessageReceived
>>>>    state = method(*args, **kw)
>>>>  File
>>>> "/home/peter/wrk/pydra.sync2/pydra/pydra_server/cluster/worker/worker_task_controls.py",
>>>> line 73, in run_task
>>>>    workunit_key, main_worker, task_id)
>>>>  File
>>>> "/home/peter/wrk/pydra.sync2/pydra/pydra_server/cluster/tasks/task_manager.py",
>>>> line 304, in retrieve_task
>>>>    module_search_path = [pydraSettings.tasks_dir, pkg.folder \
>>>> exceptions.AttributeError: TaskPackage instance has no attribute 'folder'
>>>>
>>>>
>>>>         
>>> Sorry that I missed to test that branch of the control flow. I've
>>> fixed this issue.
>>>
>>>
>>>       
>> Great!
>>
>>     
>>>> Otherwise it works really well, great job!
>>>>
>>>>  I didn't test it with larger files but i didn't really notice the
>>>> extra load time while syncing.  I like how you split out the package
>>>> loading as well, great way of handling it.
>>>>
>>>>
>>>>         
>>> Good to hear this. Thanks!
>>>
>>>
>>>       
>>>> it handles updated files but can it handle multiple versions for the
>>>> same task handled for updating while a task is running?  I know that one
>>>> was a bit more complicated and likely required saving files in a
>>>> different directory structure.
>>>>
>>>>         
>>> Technically this wouldn't cause much trouble. Physical locations of
>>> task packages is now computed simply by concatenating the tasks_dir
>>> and the package name. We read and write task packages using the
>>> package name as a key. We can include versions in the lookup too,
>>> which will allow multiple versions of task packages simultaneously
>>> exist. By examining the last-modified time of folders, we'll be able
>>> to tell which folder contains the latest code.
>>>
>>> My primary concern is, however, is complicated package deployment
>>> logic. Imagining we have already got the multi-version-package
>>> feature, I can think of two major problems (or complications).
>>>
>>> 1. The user deploys a task package called "foo" on the master. Now the
>>> user shouldn't be allowed to directly put a "foo" folder in
>>> task_cache. Since we may be tracking multiple versions of "foo", the
>>> contents of "foo" is likely to be saved in "task_cache/foo/v3", where
>>> "v3", in our implementation, is the calculated sha1 hash of the
>>> package contents. From another perspective, we should prevent users
>>> from manipulating task_cache manually, for that would cause
>>> inconsistency.
>>>
>>>
>>>       
>> you're right about this.  A user manually editing it would cause problems.
>>
>> What if we store a processed copy of the task?  We provide a TASK_CACHE
>> directory which users can deploy tasks to.  When task manager reads the
>> tasks from TASK_CACHE they are copied to TASK_CACHE_INTERNAL if its a
>> new/updated version.  The tasks that actually get loaded/executed would
>> be in TASK_CACHE_INTERNAL.
>>
>> It duplicates stored code but it allows users to change a file without
>> affecting the code pydra uses.
>>
>>     
>
> The first thing that comes up is efficiency. But it can be easily
> solved by using hard links.
>
> Besides, I guess we still have to provide certain mechanism (e.g.,
> write permission) to protect task_cache_internal, because it is
> essentially also a file system resource just as task_cache is.
>
>   
Right, but so are the pydra source files.  It just needs to be in the
documentation that you should in no way ever touch the internal cache.  
We might even go so far as to intentionally store the files in a way
that make no sense to the user (ie, directories as hash names)

> Another thing which I want to make clear is: how many different
> methods of deploying task packages are we going to support? Knowing
> this will help us consolidate the code that reads/writes tasks
> packages.
>
>   
depending on how you look at it.  1 or 2.

1) placing files in task_cache

2) through an API.  This allows for any method of deployment other than
accessing files directly. Ie. task uploading either from the website,
drop folders, etc.  This should still deploy files to the TASK_CACHE for
consistency.  From there the process of importing it should be nearly
the same. 

The only difference is the API might have some confirmation the package
was deployed, or in the case of updates a confirmation that you want to
overwrite a task.


>>> 2. When versions of a task package increases, we'll need to clean up
>>> older contents. This can be done manually or automatically. Obviously,
>>> automatic cleanup is more appealing. But to do this, we have to track
>>> the status of each version of a task package (probably in the
>>> scheduler). If it is not used by any worker, we can safely delete it.
>>>
>>>
>>>       
>> definitely an automatic cleanup.  The only reason for multiple versions
>> is to allow updates while tasks are running.  we could modify the
>> scheduler to emit TASK_START and TASK_STOP so that the task manager
>> could track which packages are in use.
>>     
>
> Yes. Signals would be great.
>
>   
>>>> -Peter
>>>>
>>>>
>>>>
>>>>
>>>> Yin QIU wrote:
>>>>
>>>>         
>>>>> Thanks Peter. That's great news.
>>>>>
>>>>> I have managed to make task synchronization work. In my experiment, I
>>>>> ran the master and a node in two different folders on the same
>>>>> machine. Initially, the node does not contain any task code in its
>>>>> task_cache. After the scheduler dispatches a task (I used the simplest
>>>>> TestTask, others will presumably do too) to the worker, the node will
>>>>> automatically synchronize the task code with the master. Similar
>>>>> things will happen too if the task code is updated on the master. As
>>>>> discussed, task synchronization is done asynchronously.
>>>>>
>>>>> Under the hood, two modules, namely TaskSycnClient and TaskSyncServer,
>>>>> interacts with each other. The former is located on the node, and the
>>>>> latter resides on the master. Currently, TaskSyncClient is a module of
>>>>> the Worker ModuleManager. After your refacotoring is done, it won't be
>>>>> hard to migrate it to the Node ModuleManager.
>>>>>
>>>>> Latest changes have been committed to the task_packaging branch on github.
>>>>>
>>>>> On Tue, Aug 18, 2009 at 12:44 PM, Peter Krenesky<peter at osuosl.org> wrote:
>>>>>
>>>>>
>>>>>           
>>>>>> Hi all,
>>>>>>
>>>>>> I've started refactoring Master, Node, and Worker to change the way in
>>>>>> which they relate to eachother.  When this refactor is complete Master
>>>>>> will only communicate with Nodes.   Node will be the only component to
>>>>>> interact with Workers.  Workers will be spawned per TaskInstance.
>>>>>>
>>>>>>
>>>>>> == WHY? ==
>>>>>>  - workers need to be chrooted (sandboxed) per TaskInstance to ensure
>>>>>> no task can affect other users.  Even importing a task file to read task
>>>>>> name and description puts the cluster at risk.
>>>>>>
>>>>>>  - Some libraries, django especially, can only be configured once per
>>>>>> runtime.  This means changing datasources is not possible under the
>>>>>> current system.
>>>>>>
>>>>>>  - less network overhead from TCP connections.
>>>>>>
>>>>>>  - simpler networking logic.
>>>>>>
>>>>>>
>>>>>>
>>>>>> == How? ==
>>>>>>
>>>>>> Master
>>>>>>     - remove WorkerConnectionManager module
>>>>>>     - change add_node() so that instead of adding workers to the
>>>>>> checker, WORKER_CONNECTED signals are emited with a special proxy object
>>>>>> that mimics a WorkerAvatar but is really the remote from the Node.  This
>>>>>> allows all other logic in Master to remain the same.
>>>>>>    - change node disconnection logic to include disconnecting workers
>>>>>> as well
>>>>>>
>>>>>>
>>>>>> Node
>>>>>>     - Add WorkerConnectionManager Module, Master's version of this can
>>>>>> be reused.
>>>>>>     - Add mechanism for tracking running workers
>>>>>>     - Add task_run that manages passing work to workers, and starting
>>>>>> new workers.
>>>>>>     - Add callback system to task_run to handle asynchronous nature of
>>>>>> waiting for a worker to start before passing on a task_run
>>>>>>     - Add remotes that proxy all other functions in
>>>>>> worker_task_controls to worker avatars
>>>>>>     - Add remotes that proxy master functions to MasterAvatar.
>>>>>>
>>>>>>
>>>>>> Worker
>>>>>>    - Modify WorkerConnectionManager to connect locally only and use
>>>>>> Node key for auth.
>>>>>>
>>>>>>
>>>>>>
>>>>>> == status ==
>>>>>>
>>>>>> Much of the above code in place but it is not tested.  I'll likely have
>>>>>> it complete within the next few days.
>>>>>> _______________________________________________
>>>>>> Pydra mailing list
>>>>>> Pydra at osuosl.org
>>>>>> http://lists.osuosl.org/mailman/listinfo/pydra
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>
>>>>>           
>>>> _______________________________________________
>>>> Pydra mailing list
>>>> Pydra at osuosl.org
>>>> http://lists.osuosl.org/mailman/listinfo/pydra
>>>>
>>>>
>>>>         
>>>
>>>
>>>       
>> _______________________________________________
>> Pydra mailing list
>> Pydra at osuosl.org
>> http://lists.osuosl.org/mailman/listinfo/pydra
>>
>>     
>
>
>
>   



More information about the Pydra mailing list