[Pydra] Updates on task packaging and task sync

Peter Krenesky peter at osuosl.org
Tue Sep 1 13:56:57 UTC 2009


Yin QIU wrote:
> On Sun, Aug 30, 2009 at 12:15 AM, Peter Krenesky<peter at osuosl.org> wrote:
>   
>> Yin QIU wrote:
>>     
>>> On Sat, Aug 29, 2009 at 2:38 PM, Peter Krenesky<peter at osuosl.org> wrote:
>>>
>>>       
>>>> The TaskSync code has been merged into master.  I haven't thoroughly
>>>> tested it yet, but the basic functionality works.  I only encountered
>>>> minor issues:
>>>>     * it was including .pyc files in the hashes so there were
>>>> mismatches when loading compiled code
>>>>
>>>>         
>>> I intentionally avoided examining only .py files in computing the
>>> hash. Because I thought not all the files in a task package were .py
>>> files - there might be dynamic libraries, configuration files, etc. Of
>>> course, we can explicitly exclude .pyc files.
>>>
>>>
>>>       
>> and thats the right thing to do.  there shouldn't be any dynamic
>> libraries within the package anyways.  We can't avoid .pyc files because
>> that is where the python runtime puts them.
>>
>> we could avoid this in a more generic way by just using the directory
>> name as the hash instead of computing it every time.  If users edit the
>> internal cache manually its likely it will break anyways.  the only
>> thing we lose is the ability to determine if the user edited it locally
>> and tell them not to do it. again.
>>
>>     
>
> If we hash directory names only, we won't be able to tell if a task
> package has code modifications if those modifications change neither
> the directory names nor the directory structure. So I think it would
> be better to keep computing hashes from the contents while excluding
> certain types of files, which can be configured flexibly, e.g., in
> settings.py.
>
>   
I wasn't suggesting hashing names only.  I was suggesting that we
shouldn't ever touch the INTERNAL_CACHE  If you modify *anything* in
INTERNAL_CACHE the task_manager will treat the files in TASK_CACHE as
new files.  Once we implement automatic cleanup of old versions, it
would also remove your modified version.  This would not be a bug, that
is how it should work. 

Since we can't manually do anything to INTERNAL_CACHE there is no reason
to require it to be hashed repeatedly.  It should be expected that if
you modify anything in INTERNAL_CACHE bad things will happen and it
*will* break.

TASK_CACHE of course still needs to be hashed every time to check for
changes.
>> I'm also considering whether we even need task_cache on Nodes, or just
>> task_cache_internal.  Really you shouldn't be manually placing files
>> there when there is no way to disable task synchronization.  We have the
>> issue of large packages that might make someone want to avoid the sync
>> client, but that should eventually be solved by using a
>> consumer/producer elements of twisted to transfer the data more effectively.
>>
>>     
>
> Currently TaskManager's at the master and the nodes are in fact
> identical. But now the situation is: we don't use the master's
> TaskManager to request synchronization and we don't use nodes'
> TaskManager's to handle sync requests. That is, we actually made a
> distinction between the two kinds of TaskManager's. So task_cache on
> nodes seems to be unnecessary.
>
> We can of course implement a mechanism to disable task
> synchronization. On the other hand, we can leave this alone, and have
> the opportunity to implement a new feature that enables P2P-style
> synchronization. For example, in a large cluster, we may update the
> task package on an arbitrary node, and expect other node, perhaps
> including the master, to sync with this node. I think this would
> greatly ease the maintenance of a cluster. But certainly this feature
> is far from our current project goal, and is hence just an imagination
> right now :-)
>
>   
Yeah P2P style distribution might be worthwhile.  I've been pondering
the idea of P2P style communication in general since we have overhead
for Nodes always talking through the master.  P2P may end up being the
defining feature of 2.0

I think that we can make synchronization more efficient.  We currently
synchronize the TASK_CACHE folder which TaskManager reads and processes
tasks from.  We repeat this for every node that the code is deployed to. 

It would be faster to synchronize INTERNAL_CACHE.  The files there have
already been processed, you just need to deploy them.


These changes aren't that high on priority right now.  We have something
that works, I'm just thinking about what the next step is for it.

>>>>     * run_task and _run_task signatures needed to be merged by hand to
>>>> match what changes I had made.
>>>>
>>>> - Peter
>>>>
>>>> Yin QIU wrote:
>>>>
>>>>         
>>>>> Hi,
>>>>>
>>>>> I just pushed some changes to my public repo. I managed to add
>>>>> preliminary support for keeping multiple versions of a task package.
>>>>>
>>>>> There are now two folders holding the task code, namely tasks_cache
>>>>> and tasks_cache_internal. The former is publicly known and is for
>>>>> deployment usage; the latter is used by TaskManager internally and is
>>>>> thus hidden from the outside world.
>>>>>
>>>>> tasks_cache always contains the latest code. We can either drop files
>>>>> to this folder or put contents into it with certain API (not available
>>>>> yet). TaskManager keeps monitoring tasks_cache, and if it notices
>>>>> updates, copies the latest task code into tasks_cache_internal, where
>>>>> it places the code in a subdirectory with the SHA1 hash of the code as
>>>>> the directory's name.
>>>>>
>>>>> I've performed a simple test against this new feature. I put a
>>>>> modified task package while running an older version of the package.
>>>>> This resulted in two different task packages in tasks_cache_internal.
>>>>>
>>>>> There is currently no cleanup mechanism yet. That is, once a task
>>>>> package is created in tasks_cache_internal, there is no automatic way
>>>>> to remove it after it expires. This issue will be resolved after we
>>>>> let the scheduler emit TASK_STARTED and TASK_STOPPED signals and
>>>>> handle these signals in TaskManager.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>           
>>>
>>>
>>>       
>> _______________________________________________
>> Pydra mailing list
>> Pydra at osuosl.org
>> http://lists.osuosl.org/mailman/listinfo/pydra
>>
>>     
>
>
>
>   



More information about the Pydra mailing list