[darcs-devel] Optimizing "darcs diff"

Wed Mar 2 05:55:10 PST 2005

On Wed, Mar 02, 2005 at 12:35:17AM -0800, Kannan Goundan wrote:
> David:
> 
> I had sent a message to darcs-devel about minimizing the copying that
> goes on during "darcs diff".  I think I have something that works;
> the patch is attached (patch against darcs-unstable).
> 
> As I mentioned before, I'm a Haskell n00b so criticism is welcome. 
> I'm a little verbose with the variable names and inline comments; let
> me know if you would prefer more concise code.  I also used nouns for
> certain functions where they seemed to make sense to me.

Better too many comments and too-long variable names than the other way
around.

> The pruning code is a bit longer than had hoped.  The canonical
> filtering predicate is actually a set of 3 functions.  It's a little
> more complicated than absolutely necessary, but I wanted it to be
> generic enough to allow a more efficient implementation later. 
> There's still support for your preferred predicat "(FileName->Bool)",
> but it's coded as a special case of the canonical one.

Indeed, this is a bit complicated.  See below.

> When dealing with Slurpy objects, I noticed that it would have made
> the pruning function a little simpler if the FileName component was
> factored out:
> 
> data Slurpy = SlurpDir (Maybe (IO ())) [(FileName, Slurpy)]
>             | SlurpFile Bool (EpochTime,FileOffset) FileContents
> 
> Of course, the rest of the Darcs code might prefer the current
> structure...I don't know.  Just thought I'd mention it.

Hmmm.  I'm not sure, that would mean that a SlurpFile wouldn't know its own
name, which might complicate things a bit.

> Should I have sent this patch directly to darcs-devel?  Should I have
> included the patch text inline?

Best is to use darcs send (at least if you're not on windows), which will
then send the patch to darcs-devel.  This way others can take a look at
it and comment, which is good.  If you can't use darcs send for any reason,
then attaching it to an email sent to darcs-devel would be best.

> +                    let path_matches = filter_leaf context
> +                        filtered_contents = prune_list context contents
> +                    in if path_matches || filtered_contents /= []
> +                        then Just (SlurpDir name io filtered_contents)
> +                        else Nothing

I'd not bind "filter_leaf context" to "patch_matches" here.  I think it
just confuses things, since it makes for one more line that needs to be
read before the if statement can be understood.

> +type TreeFilter a = (
> +    a,                   -- path context
> +    a -> FileName -> a,  -- function to fold the path context
> +    a -> Bool,           -- match individual nodes
> +    a -> Maybe Bool)     -- match entire branches

First, I'd almost always prefer "data" declarations to type declarations
equivalent to tuples.  With a data declaration, you can give descriptive
names to all the components, and can also specify if some of the components
should be strict (which isn't really relevant in this case).  So I think
this would be clearer if it were something like

data TreeFilter a = TF {
                      path_context :: a,
                      add_to_path_context :: a -> FileName -> a
                      match_just_this_node :: a -> Bool
                      match_entire_branch :: a -> Maybe Bool
                    }

But I'm also not too keen on this structure, it seems overly complex.
First off, do we really need to support arbitrary data structures for
holding the current path context?  Eliminating the "a" from the type would
make things easier to understand.

As a more functional way of doing things, perhaps we could use something
like

type NewTreeFilter = FileName -> TreeMatch
data TreeMatch = RejectedTM | AcceptedTM | KeepLookingTM NewTreeFilter

The contents of KeepLookingTM would be a possibly modified TreeFilter,
which could encapsulate your path_context and add_to_path_context
information.

We could translate a TreeFilter a to NewTreeFilter by something like

tf2ntf :: TreeFilter a -> NewTreeFilter
tf2ntf tf fn = case (match_entire_branch tf) the_context of
               Just True -> AcceptedTM
               Just False -> RejectedTM
               Nothing -> KeepLookingTM $ tf { path_context = the_context }
               -- note that the above line uses the data "field update"
               -- syntax, which I'm not very familiar with, so I may have
               -- gotten it wrong.
    where the_context = (add_to_path_context tf) fn (path_context tf)

Here I've left out the "match individual nodes", since I don't understand
why it's needed.  I can see that we'd need it if we wanted to prune in such
a way that we kept a directory but none of its contents, but I can't see
why we'd want to do that.  Maybe I'm missing something here.

Not that you'd really want to perform such a translation, but hopefully
it'll clarify whether my NewTreeFilter interface is as complete as your
TreeFilter a interface, and how the fields in your interface would work
themselves out in my proposed interface.

I haven't decided whether to apply the patch as-is.  I'm definitely not
applying it today (I need to get to work).  Hopefully you'll fix it up
based on my recommendations, and then I won't have to decide...
-- 
David Roundy
http://www.darcs.net