[darcs-users] [patch318] tune the patch parser (and 10 more)

Jason Dagit bugs at darcs.net
Wed Aug 4 04:27:51 UTC 2010


Jason Dagit <aoeu> added the comment:

I'm replying inside the bug tracker, which doesn't have a "reply" feature.  For this reason, I'm inserting "> " 
manually on the lines.  My apologies if the formatting is weird because of that.

> > Sun Jul 25 09:56:10 BST 2010  Jason Dagit <dagit at codersbase.com>
> >   * tune the patch parser
> >   This patch is based on a few observations:
> >     * alterInput (const foo), completly replaces the parser's input
> >     * after peekInput, the above trick avoids redundant processing
> >     * best to leave a packed ByteString packed
> 
> 
> BC.unpack xs == "<" vs BC.singleton '<' (new)
> 
> is the latter really better?  I'd like to see the generated 
> code or get word from a ByteString expert.

Looking at the generated code shouldn't be too difficult.  Have you ever used ghc-core?  It's a wrapper around 
ghc specifically for displaying the generated code.

Here is an example command line that you can run from the top of the repo:

cd src && ghc-core -- -XScopedTypeVariables -XUndecidableInstances -O2 -I. ../dist/build/autogen/Version.hs 
Darcs/Patch/Read.hs | less -R

Where "Darcs/Patch/Read.hs" is whatever file you want to look at.  You will of course need to do a cabal build 
first to generate dist/build/autogen/Version.hs.  You might also want to look in the cabal file to see if there 
are flags you want to pass to GHC.  I found, through trial and error, which extensions are needed to get as far 
as Read.hs.  GHC will tell you which extensions to add if you get it wrong.  Once its buffered into less you can 
search for "Darcs\.Patch\.Read\." and it should bring you to the first function in that module.  From there, 
reading the core is up to you :)  For me, it wasn't that helpful because I had a hard time mapping core to code 
in Read.hs.

Now, let me reason about why this is a good thing from a higher level than core.

xs is not a constant.  It's some token in the input stream.  The unpack is going to convert it to [Char] no 
matter what it is.  From then on we have to do traditional [Char] comparisons on it (here just (==)).  The 
alternative, is to take the constant character '<' and turn it into a bytestring where we can use the highly 
optimized bytestring functions to compare it with xs.  The next logical step, is to float the packing of '<' to 
a high enough level that the bytestring representation is just a constant within the module and it need only be 
computed once per program run.  Compare that to unpacking xs for each comparison.  I don't know if GHC did the 
floating for me, but I'd be happy to send another patch with that change if it would make you fee better about 
it.

Now, one more attempt at justifying it.  Don Stewart tells me that if you have a packed bytestring, you always 
want to leave it that way.  Minimizing unpacking is an automatic win in his book.

> You've removed initial calls to 'work myLex' in many places - isn't 
> this behaviour changing? OTOH the tests pass...

It is potentially a behavior changing change.  The trick here, is that with the existing parser, you can call 
the parsing functions without consuming input.  That's how the existing parser works.  It does a peekInput to 
get the remaining input.  Then it applies a parser function, such as myLex to that string.  This will Maybe 
return a result.  When myLex is used this way, it doesn't consume the parsers input.  You have pass myLex to the 
work parser to do that.  This result can then be used in comparisons.  After it picks the next code path, it 
commits to the decision by doing 'work myLex'.

The way things are done now is also error prone.  If the part that applies myLex is updated, then the 
corresponding 'work myLex' also must be updated to consume exactly the same thing.  It also means that we have 
to apply myLex to the input twice.  Once to get the next token and inspect it, and once later to throw it away.  
Which leads to the next comment.
 
> the whole 'alterInput' thing feels messy. Since it's barely used
> after the whole chain of patches I won't worry further about that.

I suspect you're referring to the idiom where I use 'alterInput (const s)' where s is the remaining input after 
myLex foo?  It just avoids one extra lexing on the input.  If you spend enough time with the definitions I'm 
confident that you'll agree.  I'm not the only one using that trick in the source code.  See here:
$ ack "alterInput \(const"
Darcs/Patch/Patchy.hs
216:                              Just s' -> alterInput (const s') >> ifstr

> > Sun Jul 25 13:06:51 BST 2010  Jason Dagit <dagit at codersbase.com>
> >   * add utility functions to ReadMonads
> >   The intention of these functions is that in the future we will be 
> able
> >   to use more conventional notation to express parsers in the darcs 
> source.
> 
> Generally fine.
> 
> Are the rather ugly separate definitions of bindSM 
> etc necessary for performance? If so it's a shame GHC can't 
> handle this better. My uninformed speculation would be that the 
> definitions (whether in the typeclass or not) would get inlined 
> as small anyway.

I'm really not sure.  I don't have any microbenchmarks available for the parser and the core is a bit much to 
read for me.  I can tell you this is exactly how attoparsec does it, and I know Bryan used criterion to 
benchmark it.

The reason I did it that way, is alluded to in the comment in that section, reproduced here:
-- The following instances allow us to use more conventional
-- interfaces provided by other parser libraries. The instances are
-- defined using bindSM, returnSM, and failSM to avoid any infinite,
-- or even unneccessary, recursion of instances between between
-- ParserM and Monad.  Other recursive uses will be fine, such as
-- (<|>) = mplus.

When I was working through this code, someone in #haskell suggested that if my instances for things like 
Applicative used the Monad instance I might get into some trouble.  I admit, I didn't even try it the other way.  
I don't even know if the INLINEs are needed.  If it bothers you, I can try removing it on the grounds that it's 
a premature optimization.  On the other hand, we know the parser is a bottle neck at times.  So we might as well 
optimize what we can when it's not going to be a maintenance nightmare (which I don't think the code you're 
referring to is, I think it's just not as pretty as it could be).

> I also have concerns about the existence of try - see below.
> 
> > Sun Jul 25 14:31:32 BST 2010  Jason Dagit <dagit at codersbase.com>
> >   * refactor Read and ReadMonads to remove peekInput from Read
> 
> > -          Nothing -> do s <- peekInput
> > -                        case myLex s of
> > -                          Just (ps', r) | ps' == BC.pack "merger" ->
> > -                                          alterInput (const r) >>
> > -                                          (liftM (Just . seal) $ 
> readMerger True)
> > -                                        | ps' == BC.pack "regrem" ->
> > -                                          alterInput (const r) >>
> > -                                          (liftM (Just . seal) $ 
> readMerger False)
> > -                          _ -> liftM (fmap (mapSeal PP)) $ readPatch' 
> want_eof
> > +          Nothing -> choice [ liftM (Just . seal) $ try $ readMerger 
> True
> > +                            , liftM (Just . seal) $ try $ readMerger 
> False
> > +                            , liftM (fmap (mapSeal PP)) $ try $ 
> readPatch' want_eof
> > +                            , return Nothing ]
> 
> The final return Nothing and the try in the previous case seem 
> to be new here; it's correct since try foo <|> return Nothing = 
> foo, but seems gratuitous.

If I recall correctly, I added that "return Nothing" while getting some tests to pass.  It's not the only such 
"return Nothing" so I could be mistaken.  I do recall adding one to fix a bug though.  This particular parser is 
written so that it always returns something, even if that something is Nothing.  That's somewhat of a weird 
thing to do with parsec-style parsers, but with the existing darcs parser it did make some sense.  Ideally, we 
refactor this parser so that it either fails or returns a non-Maybe value.  I didn't do that yet but there is no 
reason why other than trying not to do too much all at once.

Also, I think you're mistaken about "try foo <|> return Nothing = foo".  If the foo parser succeeds, it does 
equal foo.  When foo fails, the foo parser does not return so then the try causes the parser monad to rewind the 
input and "return Nothing" to the caller.  We have to differentiate between failing in the parser monad (no 
value is produced and that path of computation is aborted) and returning Nothing as the result of successful 
parser.  It's the difference between "Parser a" and "Parser (Maybe a)".  The parser above is in the latter 
category.

> I like the fact that each parser piece is now self-contained 
> (knows about its own intro text). However doesn't it lead to 
> myLex being called repeatedly where previously it was only 
> being called once or sometimes twice? We should be sure this is 
> benchmarked adequately and not a problem in practice before we 
> go down that route.

I have tried to provide the output of darcs-benchmark.  Please let me know if that is not sufficient evidence.  
My understanding is that darcs-benchmark is intended to cover the use cases that Real Users(tm) care about.

As per your concern, yes we probably could get better performance by allowing a 1 token peek, meaning you do 
extract the next token and return it.  Attoparsec (my current target parser library) does not support this so I 
chose not to expose that from ReadMonads.hs.

So the current parser does this inside of readPrim:
  1. It grabs all the remaining input, called s
  2. lexes s, and unpacks the result, say s'
  3. compares s' against each patch type to find the matching parser
  4. when it finds the matching parser p, it throws away s' and s (s is just a reference to remaining input, 
which does stay in memory)
  5. p is invoked on the remaining input, which is still equal to s
  6. p applies myLex to remaining input

The new parser does this inside of readPrim:
  1. gets the remaining input, called s (by def. of try function)
  2. applies some parser p, which lexes on s and compares with what it requires
  3. if lexing on s fails then we throw away our reference to s and go to 1
  4. if p found the input it requires, no redundant lexing of input is required as p is the right parser for 
this input (redundant lexing is avoided by throwing away the lexed part with alterInput)

We remove the redundant lexing in part 3, and replace it with a) redundantly fetching a reference to s (this 
should be fast), and b) lexing the next token from s for each parser.  If you compare that with the steps above, 
you will see that my version does n lexes and comparisons where n is the number of patch types, or tokens to 
look for.  The existing parser does 2 lexes and n comparisons, but it also unpacks the token so it's doing 
[Char] comparison instead of fast bytestring comparison.  As mentioned before, unpacking is bad.  Also remember 
that n here is small because we only have a few patch types to consider.

To help mitigate these O(n) lexes, we could rearrange the order that we compare the tokens thus reducing the 
average case.  I think I mentioned this previously in this ticket.  Or, I could add token peeking, although 
when/if I switch to attoparsec I'm not sure how I would implement that.  Perhaps, it can be added to attoparsec 
using the same trick I use at the end of this email to be able to push the trys further down.

I also suspect that picking the correct patch parser is not, typically, where the bottle neck is.  
linesStartingWith will be a bottle neck anytime there is a large hunk or binary patch.  I think you can see that 
linesStartingWith is probably our most performance bound parser.

> Also, doesn't the use of try introduce a memory leak, since the 
> old string has to be held onto until we know whether the entire 
> sub-parser succeeded? Previously we would commit once we'd 
> checked the first token.

You might be right about it introducing memory leaks.  This is why you ideally use try as close to the leaves as 
possible (if you think of the paths of execution in your parser as a tree of alternatives).

The existing parser, does a peek at the start.  That reference could span the entire body of readPrim, although 
in practice once you get to the "case ... of" line the reference to s can probably be marked as dead.  I don't 
know how smart the gc is about detecting an unused reference with in a scope.

In the way I have written it, the trys are not scoped over the entire body of readPrim, so that helps push the 
trys down a bit.  In the failure case of those alternatives the potential leak can be collected immediately.  In 
the success case, it's a bit trickier.  I think in that case, you can't throw it away until you get to the end 
of the parser you passed to try.  I guess that's the leak you were thinking of.

In the case of a large hunk patch, that would mean you are holding the rest of the input in memory while 
building up two separate lists of ByteString representing the old and new lines in the hunk.  Then you would 
return from the hunk parser and as try returns the gc would finally be able to collect the part of the remaining 
input that corresponds to the large hunk (but not the two lists of bytestring).  So, one way to work around 
this, would be to rewrite the readHunk like this, and check for Nothing in readPrim:

readHunk :: ParserM m => FileNameFormat -> m (Maybe (Prim C(x y)))
readHunk x = do
  x <- try (Just <$> lexString "hunk") <|> return Nothing
  case x of
    Nothing -> return Nothing
    Just _ -> do
      fi <- myLex'
      l <- int
      have_nl <- skipNewline
      if have_nl
         then do linesStartingWith ' ' -- skipping context
                 old <- linesStartingWith '-'
                 new <- linesStartingWith '+'
                 linesStartingWith ' ' -- skipping context
                 return $ Just $ hunk (fn2fp $ readFileName x fi) l old new
         else return $ Just $ hunk (fn2fp $ readFileName x fi) l [] []

I could do the same thing to the other parsers, such as readBinary.  Maybe I could even find a prettier way to 
write the above or make a new combinator for it by making a variation of lexString.

But anyway, that should fix any space leakiness that my changes could have possibly introduced.

So, if I implement the above space leak fix, do you plan to accept the patches?  I ask because if you are not 
planning to accept them I won't bother with the fix :)

Thanks for the review,
Jason

__________________________________
Darcs bug tracker <bugs at darcs.net>
<http://bugs.darcs.net/patch318>
__________________________________


More information about the darcs-users mailing list