[Pydra] Master-Node-Worker relationship refactor
allenchue at gmail.com
Wed Aug 19 09:50:13 UTC 2009
On Wed, Aug 19, 2009 at 5:31 AM, Peter Krenesky<peter at osuosl.org> wrote:
> It does not appear to load existing tasks properly. This happened when
> i ran the task a second time:
> File "/usr/lib/python2.6/dist-packages/twisted/spread/flavors.py",
> line 114, in remoteMessageReceived
> state = method(*args, **kw)
> line 73, in run_task
> workunit_key, main_worker, task_id)
> line 304, in retrieve_task
> module_search_path = [pydraSettings.tasks_dir, pkg.folder \
> exceptions.AttributeError: TaskPackage instance has no attribute 'folder'
Sorry that I missed to test that branch of the control flow. I've
fixed this issue.
> Otherwise it works really well, great job!
> I didn't test it with larger files but i didn't really notice the
> extra load time while syncing. I like how you split out the package
> loading as well, great way of handling it.
Good to hear this. Thanks!
> it handles updated files but can it handle multiple versions for the
> same task handled for updating while a task is running? I know that one
> was a bit more complicated and likely required saving files in a
> different directory structure.
Technically this wouldn't cause much trouble. Physical locations of
task packages is now computed simply by concatenating the tasks_dir
and the package name. We read and write task packages using the
package name as a key. We can include versions in the lookup too,
which will allow multiple versions of task packages simultaneously
exist. By examining the last-modified time of folders, we'll be able
to tell which folder contains the latest code.
My primary concern is, however, is complicated package deployment
logic. Imagining we have already got the multi-version-package
feature, I can think of two major problems (or complications).
1. The user deploys a task package called "foo" on the master. Now the
user shouldn't be allowed to directly put a "foo" folder in
task_cache. Since we may be tracking multiple versions of "foo", the
contents of "foo" is likely to be saved in "task_cache/foo/v3", where
"v3", in our implementation, is the calculated sha1 hash of the
package contents. From another perspective, we should prevent users
from manipulating task_cache manually, for that would cause
2. When versions of a task package increases, we'll need to clean up
older contents. This can be done manually or automatically. Obviously,
automatic cleanup is more appealing. But to do this, we have to track
the status of each version of a task package (probably in the
scheduler). If it is not used by any worker, we can safely delete it.
What do you think?
> Yin QIU wrote:
>> Thanks Peter. That's great news.
>> I have managed to make task synchronization work. In my experiment, I
>> ran the master and a node in two different folders on the same
>> machine. Initially, the node does not contain any task code in its
>> task_cache. After the scheduler dispatches a task (I used the simplest
>> TestTask, others will presumably do too) to the worker, the node will
>> automatically synchronize the task code with the master. Similar
>> things will happen too if the task code is updated on the master. As
>> discussed, task synchronization is done asynchronously.
>> Under the hood, two modules, namely TaskSycnClient and TaskSyncServer,
>> interacts with each other. The former is located on the node, and the
>> latter resides on the master. Currently, TaskSyncClient is a module of
>> the Worker ModuleManager. After your refacotoring is done, it won't be
>> hard to migrate it to the Node ModuleManager.
>> Latest changes have been committed to the task_packaging branch on github.
>> On Tue, Aug 18, 2009 at 12:44 PM, Peter Krenesky<peter at osuosl.org> wrote:
>>> Hi all,
>>> I've started refactoring Master, Node, and Worker to change the way in
>>> which they relate to eachother. When this refactor is complete Master
>>> will only communicate with Nodes. Node will be the only component to
>>> interact with Workers. Workers will be spawned per TaskInstance.
>>> == WHY? ==
>>> - workers need to be chrooted (sandboxed) per TaskInstance to ensure
>>> no task can affect other users. Even importing a task file to read task
>>> name and description puts the cluster at risk.
>>> - Some libraries, django especially, can only be configured once per
>>> runtime. This means changing datasources is not possible under the
>>> current system.
>>> - less network overhead from TCP connections.
>>> - simpler networking logic.
>>> == How? ==
>>> - remove WorkerConnectionManager module
>>> - change add_node() so that instead of adding workers to the
>>> checker, WORKER_CONNECTED signals are emited with a special proxy object
>>> that mimics a WorkerAvatar but is really the remote from the Node. This
>>> allows all other logic in Master to remain the same.
>>> - change node disconnection logic to include disconnecting workers
>>> as well
>>> - Add WorkerConnectionManager Module, Master's version of this can
>>> be reused.
>>> - Add mechanism for tracking running workers
>>> - Add task_run that manages passing work to workers, and starting
>>> new workers.
>>> - Add callback system to task_run to handle asynchronous nature of
>>> waiting for a worker to start before passing on a task_run
>>> - Add remotes that proxy all other functions in
>>> worker_task_controls to worker avatars
>>> - Add remotes that proxy master functions to MasterAvatar.
>>> - Modify WorkerConnectionManager to connect locally only and use
>>> Node key for auth.
>>> == status ==
>>> Much of the above code in place but it is not tested. I'll likely have
>>> it complete within the next few days.
>>> Pydra mailing list
>>> Pydra at osuosl.org
> Pydra mailing list
> Pydra at osuosl.org
Nanjing University, China
More information about the Pydra