[darcs-devel] Filesystem in DB, but data in filesystem

David Roundy droundy at darcs.net
Tue Jul 31 15:58:53 PDT 2007


On Tue, Jul 31, 2007 at 08:11:18AM +0200, Salvatore Insalaco wrote:
> 2007/7/30, Eric Y. Kow <eric.kow at gmail.com>:
> > > What about adding a "case-insensitive name" field (in addition to
> > > "canonical name", and "pristine name") to the pristine index?  So on
> > > Unix you might have "Makefile" and "makefile" as the canonical names
> > > (referred to pristine names like 1.dat or 233.dat) and then add in
> > > case-insensitive names "makefile" and "Makefile1".
> >
> > A similar kind of scheme might help with the Windows tricky filename
> > issue.
> 
> The biggest issue with case-insensitivity in pristine is that if there
> was a filename "conflict", any time in project history, even maybe in
> patch 1 and corrected in patch 2, a case-insensitive user cannot pull
> the repository (there's a bug on the bug reporter about this I think,
> on the GHC tree).
> Having case-insensitivity in working dir is much less a problem: we
> can just detect and complain or doing something "smart" as renaming
> the file, and it concerns only the last "revision" of the source code.
> 
> So, let's summarize a bit. I'm going to write a more detailed proposal
> when we reach a good consensus, then post it there for "final review".

I've only very slightly skimmed over what's been posted so far (so these
might be redundant), but thought I'd add a couple of thoughts.

> By the way, tell me if I'm too "chatty" :). I'm accustomed to group
> development, so I like to share a lot my thoughts.

Chatty's good, but so is concise.  (e.g. this is a pretty concise email,
and so I read it more thoroughly, not having much time for general darcs
development at the moment.)

> - Everybody agree that a "file index" solution, without using
> relational db, is better.
> - We can use a plain text storage for file index, if there aren't
> excessive performance problems.
> - We would like to compute a checksum of the file (we are going to
> read it all anyway, and it will be in FS cache after the first read).
> - We like made-up filenames, so we can help the user to not
> accidentally modify them.
> - We like made-up filenames, so we can prevent the case-insensitivity
> and not allowed filename problem on the pristine cache.
> - We like to be fast on directory and file move, so the made-up
> filenames should be independent of the path position (so we just
> update the index).
> - We can continue to use the hard-link trick for local copy of repositories.
> - The system has to check for the three corruption cases (file in FS
> but not in index, file in index but on in FS, file different in file
> index and FS), complain for the last two and offer an "optimize"
> option for the first one.
> - Instead of "sequential" filenames, we could use filenames similar to
> the one in patches directory. It helps with remote copy, as there're
> no filename conflicts.

This sounds very promising.  The hashed inventory code has a whole
framework for dealing with files whose names are the hash of their
contents, which sounds like almost precisely what you want here (and is
very git-like, incidentally, suggesting possibilities of good performance
on linux).

An advantage of reusing the hashed-inventory code is that this hashed
pristine cache could relatively easily benefit from the framework for
downloading options that have been implemented for the hashed inventory.
In particular, there's already support for looking for a hashed file in
either a central cache or another repository, which could assist in using
hard links to save disk space.  This is not a big deal, but reusing the
same framework (which isn't well documented) should make better use of
development effort (e.g. maybe once you figured out how to use it, you'd
then document this... which would document both uses).

Also, I think we'd like to have the same solution for hashed filenames
(e.g. in one directory, or split into multiple directories) for both
purposes.  (see next bullet element)  I think git uses 256 subdirectories
(the first two hex characters of the sha1 hash) which is probably a good
idea.

> - We could "partition" the files on multiple directories, with a hash
> algorithm, if there're too many (I think that most modern filesystems
> have no problems even with tens of thousands of files, but better
> check).
> 
> Anything else?

Err... see above.  :)

And see especially Darcs.Repository.Prefs for the Cache functions, and the
readHashFile and writeHashFile in Darcs.Repository.HashedRepo.  These
should do much of the busy work of creating file names for you, and nearby
is demonstration of how to write something basically like the "file
index".
-- 
David Roundy
Department of Physics
Oregon State University


More information about the darcs-devel mailing list