[Pydra] Return values of ParallelTask, etc.

Yin QIU allenchue at gmail.com
Sun May 31 13:21:23 UTC 2009


I'm sorry but I've got another issue. What if a request_worker call by
a ParallelTask fails? That is, what does a root worker do if it fails
to request a worker from the master?

The following is an excerpt from worker.py:


        def request_worker(self, subtask_key, args, workunit_key):
            """
            Requests a work unit be handled by another worker in the cluster
            """
            print '[info] Worker:%s - requesting worker for: %s' %
(self.worker_key, subtask_key)
            deferred = self.master.callRemote('request_worker',
subtask_key, args, workunit_key)


This call will invoke Master.request_worker through a WorkerAvatar.
But Master.request_worker returns no value. Since we know the master
will eventually call Master.select_worker to pick a worker to delegate
the work unit, what if select_worker fails because of lack of idle
workers, or if a subsequent call to run_worker fails? The current code
handles this by making Master.run_task() return a zero value. But
obviously Master.request_worker() does not inspect the return value of
run_task(). So does this mean that a worker has no way to detect
failure of Worker.request_worker()?

If this is true, even in current setup, we cannot guarantee that a
ParallelTask will complete with only 1 worker (in the extreme case). A
work unit might be assigned on the premise that it will be run on a
remote worker (and thus not assigned locally). But if it does not get
run, the whole task will never complete.

I know failing to request a worker is unlikely to happen in current
setup because a task acquires all available workers greedily when it
starts. But this won't still be true if we take faulty nodes and
competitive scheduling into account.

Actually with a prospective scheduler, worker requests together with
the work units (including arguments and subtask_keys) would be queued
by the scheduler (at the master side?). In this sense, a request will
eventually be handled. But it seems that the problem that a
ParallelTask cannot complete with one worker would still exist,
because we have no way to know when we should run a work unit locally.
Any thoughts on this?


-- 
Yin Qiu
Nanjing University, China


More information about the Pydra mailing list