[darcs-users] [patch351] Optimize darcs diff.

Eric Kow kowey at darcs.net
Thu Aug 19 13:37:07 UTC 2010


On Wed, Aug 18, 2010 at 21:00:47 +0000, Petr Ročkai wrote:
> I have changed the diff code to only write those files that actually changed in
> the temporary locations. On my other project (less than 1000 working copy
> files):

Hmm, so the idea here is that this optimisation would benefit us in the
case of repositories with lots of files in them (and one nice thing
about it is that users don't have to think in advance about limiting the
search in order to get benefits).

> (with cold cache)
> head: darcs diff  1,09s user 0,58s system 9% cpu 16,752 total
> now:  darcs diff  0,14s user 0,04s system 6% cpu 2,978 total
> 
> (with hot cache)
> head: darcs diff  0,36s user 0,18s system 98% cpu 0,548 total
> now:  darcs diff  0,06s user 0,01s system 86% cpu 0,078 total

It sounds like we need some sort of darcs diff test for darcs-benchmark.
I'm not sure how, maybe systematically diff against the last tag?

> PS: Presumably, it would be yet more efficient to not write anything out and
> not call an external diff tool in cases that can be avoided. I might release a
> cabal library for formatting diffs, so if we know how to handle the options and
> no explicit --diff-command is given, we could just do the diffing internally. I
> guess the win is not going to be as big as from this change, though.

Risk assessment trying to do our own diff formatting?  I'd be a bit
concerned about increasing our areas of responsibility (even in the
form of a standalone library)...

> PPS: The bundle depends on the path refactor one, since I became too frustrated
> with converting the paths half dozen times to satisfy the typechecker (I gave
> up facing a FilePath -> SubPath conversion, which would probably require
> another couple lines of path conversion code...)

OK, if it helps, here's a quick look at the diff patch itself assuming
the refactor goes in and we figure out what to do about old-fashioned.

Optimize darcs diff.
--------------------
COMMENT: What can you say to re-assure me that this filtering does what
         we expect with file and directory moves?

         For example, do the contents of paths argument 
         need to be adjusted for move patches before we take the
         intersection with touches?

> -                            else fixSubPaths opts args
> +  paths <- if null args then return []
> +                        else fixSubPaths opts args

Path_list gets renamed to the simpler, better paths.

> +  patchset <- readRepo repository

> +  unrecorded <- fromPrims `fmap` unrecordedChanges (UseIndex, ScanKnown) repository paths
> +  unrecorded' <- n2pia `fmap` anonymous unrecorded

Would a point-freer approach be safer here (to avoid accidentally using
unrecorded when we really mean unrecorded'?)

> +  Sealed all <- return $ case (secondMatch opts, patchset) of
> +    (True, _) -> seal patchset
> +    (False, (PatchSet untagged tagged)) -> seal $ PatchSet (unrecorded' :<: untagged) tagged

If we're using a "second" matcher (--to-p, --index=N-m), then we don't
need to care about the working directory at all (because at this point
in time, using any sort of second matcher implies stopping at least at
recorded).  Otherwise we have to slip in the patches from recorded to
working.

> +  Sealed ctx <- return $ if firstMatch opts
> +                            then matchFirstPatchset opts patchset
> +                            else seal patchset

If we're diffing against some point in the past (so --last=N, --from-p,
--index=n-M) we need to unpull the all the way back to the starting
point.

If we don't specify any sort of firstMatch, then what we're really
asking for is a diff between recorded and working so the context *is*
the patchset.

> +  Sealed match <- return $ if secondMatch opts then matchSecondPatchset opts patchset
> +                                               else seal all

Now about that second matcher.  If we'd asked for one then we basically
have to unpull any patches that come "after" the first/second matchers
dependency-wise.  Without a second matcher, we just pass everything
through, working directory and all.

COMMENT: I'm not sure if 'patchset' or 'all' would be the one to pass in
here (although I guess they're the same anyway, so it's just a code
reading question).

> +  (_ :>> todiff) <- return $ findCommonWithThem match ctx
> +  (_ :>> tounapply) <- return $ findCommonWithThem all match

Recall from issue1873 that @findCommonWithThem us them@ uses PatchInfo
from them and commutes the sequence @us@ into @(common :>> just_us)@.

So above, the sequence is

  ctx --(todiff)--> match --(tounapply)--> all

> +  base <- if secondMatch opts then readRecorded repository
> +                              else readUnrecorded repository []

COMMENT: This almost sounds like we're redundantly making the decision
we made in deciding what 'all' is.  I wonder if there's any way around
it, like having some sort of big tuple

 (all, base, etc) <- if secondMatch opts
                        then ...
                        else ...

Perhaps something like this would help us keep the code bug free?
(although your version is nice and airy, sort of easy to
digest in little bits and pieces).

> +  let touched = listTouchedFiles todiff
> +      files = case paths of [] -> touched
> +                            _ -> touched `intersect` paths
> +  relevant <- restrictSubpaths repository files

Style tweak: if null?

Or maybe intersectWithUnlessEmpty xs with a less yucky name.

> +  {- print files
> +  print $ mapFL info todiff
> +  print $ mapFL info tounapply -}

Looks like debug code that accidentally got left in.

> +  let filt = applyTreeFilter relevant . snd
> +      ppath = "_darcs/pristine.hashed"

> +  oldtree <- filt `fmap` hashedTreeIO
> +                (apply . invert $ unsafeCoercePEnd todiff +>+ tounapply) base ppath
> +  newtree <- filt `fmap` hashedTreeIO
> +                (apply . invert $ tounapply) base ppath

COMMENT: Do we need the unsafeCoercePEnd? I'm not qualified to tell (at
least not without a Think-Real-Hard phase)

But this looks right otherwise.  The two trees are what you get by
unapplying down to the context and the match respectively.

Random remark: (apply . invert) seems like another one of these cases
where a little bit of duplication is actually quite nice than trying to
refactor every single possible thing you reuse, much like hPutStr
stderr.  There's a slight nuance I don't have the words for yet...

>    withTempDir ("old-"++thename) $ \odir -> do
> hunk ./src/Darcs/Commands/Diff.lhs 241
> -    setCurrentDirectory formerdir
>      withTempDir ("new-"++thename) $ \ndir -> do

[snip old tree-building machinery from the simpler more innocent code]

> -    thediff <- withCurrentDirectory (toFilePath odir ++ "/..") $
> -                   case path_list of
> -                   [] -> rundiff (takeFileName $ toFilePath odir) (takeFileName $ toFilePath ndir)
> -                   fs -> vcat `fmap`
> -                         mapM (\f -> rundiff
> -                               (takeFileName (toFilePath odir) ++ "/" ++ toFilePath f)
> -                               (takeFileName (toFilePath ndir) ++ "/" ++ toFilePath f)) fs
> -    morepatches <- readRepo repository
> -    putDoc $ changelog (getDiffInfo opts morepatches)
> -            $$ thediff
> +      withCurrentDirectory formerdir $ do
> +        writePlainTree oldtree (toFilePath odir)
> +        writePlainTree newtree (toFilePath ndir)
> +        thediff <- withCurrentDirectory (toFilePath odir ++ "/..") $
> +                       rundiff (takeFileName $ toFilePath odir) (takeFileName $ toFilePath ndir)
> +        morepatches <- readRepo repository
> +        putDoc $ changelog (getDiffInfo opts morepatches)
> +               $$ thediff

I *think* this block is just indentation change as a consequence of the
rest of the code (plus writing out the trees we built).  We run diff on
the tree as a whole.

-- 
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
For a faster response, try +44 (0)1273 64 2905 or
xmpp:kowey at jabber.fr (Jabber or Google Talk only)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://lists.osuosl.org/pipermail/darcs-users/attachments/20100819/0f28ad0e/attachment.pgp>


More information about the darcs-users mailing list