[darcs-users] darcs-hs/hashed-storage review

Ganesh Sittampalam ganesh at earth.li
Wed Aug 5 23:37:22 UTC 2009

On Wed, 5 Aug 2009, Petr Rockai wrote:

> Ganesh Sittampalam <ganesh at earth.li> writes:

> > My most important question is about the dividing line between 
> > hashed-storage and darcs-hs. The biggest weakness of hashed-storage as 
> > a separate package is that it has functions that are specific to Darcs 
> > repositories. To my mind, it either needs to abandon this knowledge, 
> > perhaps by abstracting it over a typeclass that is meaningful in 
> > itself, or some or all of hashed-storage needs to move into the darcs 
> > tree proper. Otherwise, changes to some parts of darcs will continue 
> > to require lockstep changes to hashed-storage - which is ok when Petr 
> > is developing the two together intensively on his own, but not really 
> > when other people are involved or over a longer timeframe. One option 
> > would be to move it all into the darcs tree but distribute it as a 
> > separate cabal package with a more darcs-specific name.

> Well, I have done a bunch of work on separating the darcs-specific bits. 
> For now, they live in the library, because they are an important testing 
> resource. However, out of Storage.Hashed.Darcs, there are only a few 
> darcs-specific bits living in Storage.Hashed. It also seems that 
> abstracting the darcs-specific bits is workable.

> I however expect this to be useful outside of darcs proper as well. The 
> lock-step development is a problem, but I am not sure how much it can be 
> eliminated. Probably something for further discussion.

I guess the question is what the intended use cases for hashed-storage 
outside darcs are. I can see two possibilities:

(a) General hash-structured storage library. It feels like it should be
generally useful but the only concrete idea I can think of would be for 
reading git repos. In this situation hashed-storage really does belong as 
an independent library.

(b) A way of reading darcs repos (and perhaps writing them, for the
brave). This would be an important step towards a stable darcs API, and
clearly should be a separate cabal package, but if it's just that, I think
it should be called something darcs-specific (darcs-storage?) and live in
the darcs tree.

I'm not saying that either of these use cases have to exist prior to 
merging into mainline, just that it has to be understood (and documented) 
what the expectations are.

> > - In various places (e.g. deleting on windows, corrupt pending etc) a 
> > file gets renamed to a fixed different name, which could cause 
> > problems if it already exists. There should be some standard scheme to 
> > avoid collisions.

> The corrupt pending behaviour is lifted from pre-existing darcs code. 
> For deleting on windows, this indeed needs a better solution, but the 
> current one mostly works for what its worth: if the old file exists, it 
> is quite unlikely to be in use. I have asked elsewhere about disposing 
> of files on win32, the only problem with _darcs/trash is that 
> hashed-storage can't know about it.
> > - We need to sort out haskell_policy, the ratification of readFile all 
> > over the place is just silly. Replacing it with hlint as has been 
> > suggested sounds like a very good idea.

> Yes, I just didn't have the time to implement the hlint replacement yet. 
> If people feel this is important, I can try to get this done before end 
> of SoC.

No, I just mentioned it in passing. I think this is a better standalone 
task that would just distract from the immediately important stuff, unless 
it's very quick or easy.

> > - I think treeDiff is actually relatively unlikely to be wrong because 
> > it delegate s actual line by line diffing to Patch.Prim which hasn't 
> > changed. Storage.Hashed.Diff exists but seems unused, what's the 
> > status and plans there?

> The line/line diff is not a problem -- as you say, this is done through 
> hunk canonization. The problem is that unsafeDiff does unhealthy amount 
> of magic about empty files, tacking newlines here and there. I have 
> tried to simplify the logic while retaining the behaviour, but bugs 
> could have happened, even though unlikely, as I'd expect these to show 
> up on the testsuite. (I previously got it wrong and it has shown up... 
> but, you never know.)

I guess some QuickCheck is in order then, but it's not necessarily crucial 
that treeDiff actually does replicate unsafeDiff, so long as it does 
actually properly represent the diffs (otherwise users might be left with 
unrecorded changes which would be very weird/confusing).

> > - Presumably biggest risk is random bugs with invalid indexes. An 
> > explanation of when the index can become invalid would be helpful 
> > (there may be one in the code that I haven't spotted yet).
> I recall stating this somewhere, but it could be in a mail or IRC. I'll 
> add that to the inline documentation:
> -- | Mark the existing index as invalid. This has to be called whenever the
> -- listing of pristine changes and will cause darcs to update the index next
> -- time it tries to read it. (NB. This is about files added and removed from
> -- pristine: changes to file content in either pristine or working are handled
> -- transparently by the index reading code.)
> The consequences of a missed invalidate would be that the file is 
> considered deleted or added to the working copy, respectively, 
> regardless of its actual state. I suppose recording such state could 
> lead to inconsistent repository, even. We should indeed double-check 
> that the index is invalidated properly (this should always happen before 
> the pristine change happens, as extra invalidations are harmless).

Also, darcs check should check the index (and darcs repair fix it).

> > - Storage.Hashed.Index: code is full of magic numbers
> Yes, this module is a mess. :| I'll clean that up on HEAD, since I have 
> already done some improvements there (representing hashes in pure binary 
> form instead of ascii-hex strings). See also later about "finish".
> > - Petr and I have already discussed overloading Tree and TreeIO over 
> > the container (IO). This would make writing test rigs easier, and mean 
> > that code that processes Tree could be polymorphic where appropriate. 
> > Overloaded code would typically constrain the container to be in Monad 
> > or perhaps Applicative.
> This is actually done in HEAD. See 
> http://repos.mornfall.net/hashed-storage/dist/doc/html/hashed-storage/Storage-Hashed-Monad.html

Cool, it looks good.

> > - Win32 removeFile workaround: not used consistently, should be hidden 
> > behind some standard abstraction?
> Definitely, I just tacked this on in the beta cycle for 2.3 to make 
> tests pass on windows.
> > - I don't really understand what virtualTreeIO does and doesn't do.
> We can discuss this, or I can try to improve the haddocks.

I've read through it again and I now understand what's going on. The fact 
that IO actions (or whatever monad) are all over the place - in the Tree, 
in the sync operation, and as a lower layer of the monad - is rather 
confusing, though I can see why it's that way now I've got to grips with 
it. I think the main improvement to the haddocks that would have helped me 
is to change "get somehow reflected in the actual real filesystem" to 
"might get somehow reflected ...".

> > - I think TreeIO should be abstract, i.e. a newtype, unless there is 
> > some value in exposing the fact that it is a StateT.
> Yes, you are probably right. It was just me being lazy.
> > - hashedTreeIO never deletes things, so what removes dead stuff?

> Well, hashedTreeIO is currently not used. Well, it is used by darcs 
> check/repair in darcs-hs, which does its own cleanup. This is however 
> something that will need addressing.

OK, I'm now rather lost about what is needed for what. I'd have expected 
hashedTreeIO to be needed for everything that touches pristine in a hashed 
repo. Is this just a work in progress?

> > - I'm very dubious about semantics of finish - why doesn't expandPath 
> > call it? Presumably the documented usage semantics of readIndex have 
> > something to do with this, but it feels fragile, and I only sort of 
> > understand the whole dirs_changed/finish stuff in readIndex'
> You are right about being dubious. I however haven't found a better way 
> to reasonably implement index updates. In retrospect, it may have been 
> better to just provide a specialised "expand" for the index tree, that 
> would also do the update. I'll check if there's a reason preventing 
> that, and if no, I'll go for it in hashed-storage 0.4 and just get rid 
> of finish.

That would definitely be an improvement, but readIndex would still return
a Tree that wasn't quite like normal Trees, which makes me uncomfortable.
We should discuss a bit more what it's trying to achieve (perhaps on IRC).


More information about the darcs-users mailing list