[darcs-devel] announcing darcs 2.0.0pre1, the first prerelease for darcs 2

Fri Jan 4 20:21:35 UTC 2008

On Fri, Dec 21, 2007 at 04:12:49AM +0300, Dmitry Kurochkin wrote:
> I have completed initial work on libwww pipelining. Output of darcs whatsnew
> is attached (sorry for that, I will try to make a proper patch tomorrow).
> What is done:
> - libcurl functionality is implemented using libwww. Now pipelining works.
> - New Libcurl module provides 3 functions:
>   * copyUrl - same as copyUrl from Curl.hs. It uses copyUrls and waitNextUrl.
>   * copyUrls - takes (filename, url) list, creates requests and adds
> them to libwww. Does not load anything.
>   * waitNextUrl - starts libwww event loop and blocks until first url
> loads (or error happens). After it returns it should be possible to
> add more urls to queue using copyUrls again. waitNextUrl should be
> called as many times as urls are in the queue.

Thanks for this contribution! I've finally gotten around to writing the
promised configure support for this, and it looks pretty nice, particularly
as a starting point for an internal API that we can use (and hopefully
which can also be supported through the curl multi API).

I've got a couple of suggestions/questions, now that I've had time to look
at the actual code.

How hard would it be to make a function

waitForURL :: String -> IO ()

which ensures that we've already got the given URL.  This would allow us to
speculatively call copyURLs to grab stuff we expect to use later, without
keeping track of the order in which they were queued (so as to call
waitNextUrl the proper number of times).  I think this would be a real
improvement.

Related to this would be a feature to ignore duplicate calls to copyURLs.
This may not be supported by libwww itself, but it'd be really handy, again
for speculative triggering of downloads.

Also related to this idea:  can we adjust the order of downloads in the
queue? e.g. maybe I'd like to add a file towards the front of the queue
because I need it right now.  This might be doable if waitForURL could bump
up the priority of that URL, in case it hasn't yet been requested from the
server.

I'm thinking of situations like this:

We're doing a darcs get.  This involves grabbing all the inventory files
and all the patch files from the server.  Each inventory file has pointers
to many patch files and the next inventory file.  We don't know how many
patches there are in the repository until we've downloaded (and read) all
the inventory files.

We could get the inventory files sequentially with no pipelining, count the
patch files, and then grab the patch files with pipelining and providing
nice feedback.  But this is a bit ugly:  we waste all that time while
grabbing inventory files and waiting for the entire latency, when we
already know where a whole bunch of patch files are that we could be
grabbing.

So a faster alternative would be once we have the first inventory file to
queue up the second inventory file and also all the patch files listed in
that inventory.  Then when we get the second inventory, we queue up the
third inventory and all the patch files in the third inventory, etc.  This
is ugly (with the current API) because we won't get the last inventory
until we've already downloaded almost all the patch files.  It's very fast
(everything is pipelined), but because we've got a FIFO queue, the third
inventory can't be grabbed until we've already gotten all the patch files
from the first inventory, so we can't give nice feedback counting the
number of patch files we've got versus the total number.

Which is why it'd be nice to be able to prioritize the inventory files that
we're waiting on, so that we queue up the second inventory followed by all
the patches listed in the first inventory, but then when we get the second
inventory, we slip the third inventory in at the head of the queue.  So we
get all the inventories pretty quickly (although probably not as quickly as
if we took the first approach) and we're also interleaving the downloading
of patch files, keeping the pipe full (in theory, anyhow).

> At the moment the only place where copyUrls is used is get command.
> But I hope this interface
> is enough for Darcs. If not - we need to think of smth more complex.
> Waiting for comments here.

Hmmm.  I think comments are above.  It's actually not a bad interface as
is, but seems waitNextUrl seems a bit awkward to use.  Actually, it has now
occurred to me that we could implement waitForURL as a wrapper around
waitNextURL, if we kept tabs on what had been shoved in the queue.  It
seems a bit ugly, but we could live with that sort of solution, if libwww
doesn't have this functionality.

> What is missing:
> - DARCS_PROXYUSERPWD is not used (but http_proxy works).
> - Proper error handling.
> - Not tested.
> - ???

These issues are somewhat less critical now that this can coexist with the
libcurl code.  Only interested users are likely to use the new code, so
it'll have a bit of time to mature.

I haven't yet done any performance testing myself.  That comes next (and
requires using my laptop, since the network of my work computer is too fast
for this to have a noticeable effect, as far as I can tell.

I expect I'll be applying this soon to the unstable repository.

David