[darcs-users] Handling relative directories

Jason Dagit dagit at codersbase.com
Tue Apr 6 16:03:57 UTC 2010

On Tue, Apr 6, 2010 at 6:26 AM, Petr Rockai <me at mornfall.net> wrote:

> Duncan Coutts <duncan.coutts at googlemail.com> writes:
> >> Doing a good job of the library won't be a quick job, but it sounds
> worth
> >> doing if we can find someone willing :-) Any volunteers?
> >
> > Another possibility which uses somewhat less FFI would be to use a
> > different representation of FilePath in darcs. I understand you already
> > do use some different type.
> >
> > One option would be to use a representation like a reverse-order list of
> > path components, with each component stored as a short packed string.
> > That allows for sharing between paths and would reduce the cost of using
> > long absolute paths.
> >
> > You'd still have a conversion via String to use the System.IO functions,
> > but at least that's only temporary garbage so affects just CPU time not
> > overall memory use.
> Interesting idea. I have already started using a path type that is a
> list of components (represented as bytestrings), it just did not occur
> to me to make it reverse to improve sharing. I'll try to look into doing
> that.

I could be wrong, but I was under the impression that many short bytestrings
leads to memory fragmentation in the GC.  I was also under the impression
that small bytestrings have worse overhead than small Strings.

It would be interesting to see any changes to the way we store/manipulate
filepaths benchmarked on pathological cases such as importing the entire
linux kernel tree.  If I recall correctly that test case has about 30k file
paths.  The last time I was looking at that case, I made diffing trivial and
discovered that our path manipulations can be quite expensive.

At that time, Simon Marlow suggested memory fragmentation as a possible
explanation for why the memory performance was bad.  I still don't
understand how to determine when ByteString will show up in a heap profile
and when only the block of pointers to the ByteStrings will show up
instead.  It has something to do with the allocation primitives used in
GHC.  Was it pinned vs. unpinned memory?  What bothers me about this
explanation is that I don't understand why our filepaths would be allocated
one way and most other bytestrings allocated the 'normal' way.  But, I do
recall that my heap profiles were lying about the space usage of my program
and what we concluded was that bytestrings were not correctly accounted for
in the heap profiles.  This work must have been about a year ago when there
was a Haskell hackathon in Portland at PSU.

Anyway, these are things to consider when you're polishing up the work and
trying to characterize the performance aspects.

> On the other hand, what we have now has very little overhead, since the
> relative paths are stored packed, 0-terminated inside the index and we
> call stat (by far the most time-critical file path use) with a pointer
> into the mmap'd index. This is, unfortunately, not going to work with
> absolute paths.
> We can resolve this internally on POSIX-y systems by calling fstatat
> instead of stat, presumably. On the other hand, this is POSIX.1-2008 and
> it does not exist on, say, Linux 2.4 or older (or even early Linux
> 2.6). Means we need compatibility wrappers... either that, or disable
> any threading support if these are not available.
> It *might* actually be a reasonable compromise to only provide threading
> on sensible systems (i.e. those that have openat/f*at...).

Yes, although given the way cabal works this might be tricky to detect at
compile time.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osuosl.org/pipermail/darcs-users/attachments/20100406/f6f97452/attachment.htm>

More information about the darcs-users mailing list