[darcs-users] Re: meta robots nofollow in darcs.cgi?

Mon Feb 21 13:15:31 UTC 2005

On Sun, Feb 20, 2005 at 02:03:36PM -0700, Will wrote:
> Mark Stosberg <mark at summersault.com> writes:
> 
> > David Roundy wrote:
> >  > Hello users of darcs.cgi (and Will!),
> >  >
> >  > I just noticed that google was browsing the darcs repository history
> >  > chewing up cpu on the darcs.net server.  I fixed it (or hope it will
> >  > be fixed when google notices) by adding a robots.txt file, but this
> >  > made me wonder whether we should add a <META NAME="ROBOTS"
> >  > CONTENT="NOFOLLOW"> tag to the output of darcs.cgi?
> >  >
> >  > It seems to me that *almost* never would you like a robot to be
> >  > indexing the contents of your repository, since the required calls
> >  > to annotate are slow, and there are vast numbers of links in any
> >  > reasonably-sized repository.  But it may be that some users would
> >  > prefer to have their repository histories indexed.
> >  >
> >  > Any thoughts?
> >
> > I think this should be a user option, perhaps with the default being
> > no indexing for safety.
> >
> > I have gotten useful answers from Google because it was able to search
> > in published source code, so I think it should be an option. Although
> > if 'annotate' is the only command used to display the source...and
> > it's always slow, maybe there is no good way currently to have the
> > source searchable with decent performance.
> 
> I like the idea of configurability but this seems like something that
> may need to differ depending on the page content.  It might be nice
> for the indexer to hit the repository and content listings but
> not annotations and patch listings.

Hmmmm.  That would make sense.  You might often want to allow indexing of
the current version of files, but almost never would you like to allow
indexing of *all* versions of files.  It'd just be silly!

The question is how do we manage the tradeoff between configurability and
complexity? Perhaps if we enumerate the different sorts of pages as
"browse", "annotate_file", "annotate_patch", "patches", "patches_specific"
(which would mean patches to a given file or directory) etc, we could then
have in cgi.conf

robots_can_index = ALL
robots_can_follow = browse patches

This would seem relatively straightforward to understand, I think.  The
trick would be enumerating the page types in a way that is clear to users.

> Perhaps the recent link rel="nofollow" attribute[1] would work?  I'm not
> sure if those links are followed and not indexed or just ignored
> entirely.

I agree that it doesn't seem clear that rel="nofollow" would prevent
indexing.  The google link only makes it clear that it will keep google
from using that link in its scoring system.
-- 
David Roundy
http://www.darcs.net