[darcs-users] GSoC: network optimisation vs cache vs library?

Alberto Bertogli albertito at blitiri.com.ar
Fri Apr 16 05:25:08 UTC 2010


On Wed, Apr 14, 2010 at 08:18:21PM -0400, Max Battcher wrote:
> On 4/14/2010 19:23, Zooko Wilcox-O'Hearn wrote:
>> Our project web site was just down for about an hour and a half a couple
>> of hours ago. The reason turned out to be that there were about a dozen
>> darcs processes running trying to answer queries like this:
>>
>> darcs query contents --quiet --match "hash
>> 20080103234853-92b7f-966e01e6a40dbe94209229f459988e9dea37013a.gz"
>> "docs/running.html"
>>
>> This is the query that the trac-darcs plugin issues when you hit this
>> web page:
>>
>> http://tahoe-lafs.org/trac/tahoe-lafs/changeset/1782/docs/running.html
>>
>> That particular query when run in isolation (i.e. not concurrently with
>> dozens of other queries) takes at least 20 seconds, and about 59 MB of RAM.
>>
>> Enough of these outstanding queries had piled up that the server ran out
>> of RAM and stopped serving our trac instance or allowing ssh access for
>> about an hour and a half.
>
> All of which goes to show that Trac+darcs still isn't well optimized for  
> caching darcs queries or dealing gracefully with with long running  
> command invocations... I still say the Trac reliance on CVS/SVN-style  
> revision numbers means that Trac is absolutely not well-adapted for  
> serving darcs repositories. It may be "revision 1782" to Trac, but 'show  
> contents --match "hash 2008..."' is "commute this file to how it would  
> appear if only the patches preceding or equal to this one with a  
> timestamp from two years ago were applied" to darcs. (Which ends up  
> being quite possibly not a "real" historic version at all, and which  
> does quite a bit of work to be so easily susceptible to  
> crawlers/DDoS/accidental DDoS...)

I'm sorry, but darcsweb also has a similar issue and I don't think either
darcsweb nor trac are to blame.

The fact is that darcs is slow for some operations, and some of those are not
irrelevant for more-or-less-daily usage, even if they do not fit well in darcs'
model.


> 20secs doesn't sound unreasonable from the point of view that you are  
> asking darcs to create an entire new "version" of a file. While I expect  
> there is plenty of performance left to squeeze from this, I don't think  
> a query like this one will ever near git/svn/... historic revision  

git used to be really slow for annotates, because it does not suit its model
as well as the other operations. It just got better, to the point where it's
still a slow operation, but fast enough that people can use it just fine.

Darcs has done the same many times, with different operations.


> lookup, because this is an entirely different beast. It doesn't make  
> sense for me for Trac to rely on it for common queries.

They're frontends.

They do not attempt to implement the operations that make sense to darcs, they
implement the operations that are useful to the people using them.


> Maybe you should sponsor someone to work on "web scalability" for you.  
> For instance, a bit of AJAXy "long-running process" support ("Please  
> wait while this ahistoric version is fetched...") and a basic task queue  
> (RabbitMQ, Amazon SQS, whatever) to keep the server from biting off more  
> than it can chew at any given point... (Or even spreading about the  
> cache generation misery to more than one server. Queues are very useful  
> that way.)

darcsweb is a CGI file so it's easy to run and install. It's also really fast
when used with caching enabled. You can scale darcsweb quite easily: just run
as many instances as you want, with mirrored repositories. Probably far easier
to deploy at scale than most of the alternatives (because it is read-only),
although nobody never wanted to do such a thing, AFAIK.

While trac is a bit more heavyweight (it is a bigger project), I bet it
performs similar tricks.


As I understand it, he was presenting a real-life case where darcs is slow for
an operation users want to perform. The fact that it's hitting it through a
web interface is not the issue, I don't see why a web interface should be any
different in this case than a gui or command line one.

It may be for very very large sites (like a darcs' equivalent of github), but
not for people running darcsweb or trac (or whatever they may run) for
personal or organizational use.


> Forgive my petulance, but it seems to me fairly odd to me that for  
> someone working on a project for decentralized, scalable data storage  
> you seem fairly blind to web scalability issues when it comes to  
> Trac+Darcs...

As I mentioned above, I don't see how this is a web scalability issue. A gui
frontend would show the same behaviour, because it's the one displayed by
darcs itself.


I understand if you don't think this is something darcs should prioritize, or
if it's not a big deal, or if because of how darcs works internally it is
expected. I think all of those are valid points (even if I don't agree with
some of them).

Zooko's post was, as I understand it, about the fact that his team _does_ want
this operation to work well, and I think that's interesting and valuable
information to have in mind when deciding what to optimize first.

Thanks,
		Alberto



More information about the darcs-users mailing list