[Pydra] Return values of ParallelTask, etc.
allenchue at gmail.com
Sun May 31 13:21:23 UTC 2009
I'm sorry but I've got another issue. What if a request_worker call by
a ParallelTask fails? That is, what does a root worker do if it fails
to request a worker from the master?
The following is an excerpt from worker.py:
def request_worker(self, subtask_key, args, workunit_key):
Requests a work unit be handled by another worker in the cluster
print '[info] Worker:%s - requesting worker for: %s' %
deferred = self.master.callRemote('request_worker',
subtask_key, args, workunit_key)
This call will invoke Master.request_worker through a WorkerAvatar.
But Master.request_worker returns no value. Since we know the master
will eventually call Master.select_worker to pick a worker to delegate
the work unit, what if select_worker fails because of lack of idle
workers, or if a subsequent call to run_worker fails? The current code
handles this by making Master.run_task() return a zero value. But
obviously Master.request_worker() does not inspect the return value of
run_task(). So does this mean that a worker has no way to detect
failure of Worker.request_worker()?
If this is true, even in current setup, we cannot guarantee that a
ParallelTask will complete with only 1 worker (in the extreme case). A
work unit might be assigned on the premise that it will be run on a
remote worker (and thus not assigned locally). But if it does not get
run, the whole task will never complete.
I know failing to request a worker is unlikely to happen in current
setup because a task acquires all available workers greedily when it
starts. But this won't still be true if we take faulty nodes and
competitive scheduling into account.
Actually with a prospective scheduler, worker requests together with
the work units (including arguments and subtask_keys) would be queued
by the scheduler (at the master side?). In this sense, a request will
eventually be handled. But it seems that the problem that a
ParallelTask cannot complete with one worker would still exist,
because we have no way to know when we should run a work unit locally.
Any thoughts on this?
Nanjing University, China
More information about the Pydra