[darcs-users] darcs patch: optimize firstspace/firstnonspace to remove boxed retu...

Eric Kow kowey at darcs.net
Fri Oct 2 11:31:32 UTC 2009


On Sun, Sep 27, 2009 at 15:10:01 -0700, Jason Dagit wrote:
> While profiling away on darcs I discovered that we create a lot of
> intermediate garbage while parsing due to what is otherwise a very,
> very small intermediate allocation.

Thanks for this work.  It's nice to see Darcs get the kind of profiling
attention it deserves.

> ...
> Please let me know if you have any questions!

I have been ignoring this patch because it's over my head, probably the wrong
attitude on my part!  But since Jason requested attention, I'll wade in and
hope desperately not to muck things up.

You know I'm pretty risk-averse, so my questions are (obviously) going to be
how risky is this? 

 Risk 1 - What are the chances that this subtly changes the meaning of
          our code?  It doesn't seem to from my inspection...
 Risk 2 - What are the chances that this will cause general Bad Things
          to happen?

The fact that Don has looked at it and another HacPDXer gives me some
confidence, but Ian warning against it gives me pause.

Ian: do you think could help me understand some more?  Are your objections
about riskiness (as in 1, 2, or other), or more about long term
maintainability?

As I understand it, the discussion on IRC came to suggest that a lot of these
performance issues are just GHC problems and should be fixed there, rather than
in Darcs, and that we could perhaps hope for this to be magically fixed in a
future GHC.  If this were the case, what sort of time-frame are we looking at
here?  If this sort of thing could take over a year to fix, then it sounds like
we should think hard about including it as a workaround in the meantime.
That's two Darcs releases right there!

I think we should push this if (a) a real Haskell hacker gives us the go-ahead
(b) Ian tells me the risks are just long term things (c) we write a note to
ourselves reminding us to clean this up and re-write it a simpler way later and
why

Before before and after after
-----------------------------
To keep things in perspective, it's helpful to remember that this used to be
a C function:

#define ISSPACE(c) \
    ((c) == ' ' || (c) == '\t' || (c) == '\n' || (c) == '\r')

int first_white(const char *s, int len)
{
    const char *start;
    const char *end;

    for (start = s, end = s + len; s < end && !ISSPACE(*s); s++);

    return s - start;
}

which we imported like this:

foreign import ccall unsafe "fpstring.h first_white" first_white
    :: Ptr Word8 -> CInt -> IO CInt

So this may give us a sort of baseline to compare against.  If it's no
riskier than the old darcs 1.0.9 approach, why not?

For comparison, here's that new code again.

  firstspace :: Ptr Word8 -> Int -> Int -> Int
  firstspace (Ptr p) (I# n) (I# m) = I# (first p n m)
    where
    first :: Addr# -> Int# -> Int# -> Int#
    first addr n' m'
        | n' >=# m' = n'
        | otherwise = if isSpaceWord8 ch
                        then n'
                        else first addr (n' +# 1#) m'
      where
      ch = indexWord8OffAddr# addr n'

optimize firstspace/firstnonspace to remove boxed return value
--------------------------------------------------------------
> -                   hashed-storage >= 0.3.8 && < 0.4
> +                   hashed-storage >= 0.3.8 && < 0.4,
> +                   ghc-prim

Ian again: why is it objectionable to be using GHC primitives?  Is it because
of the likelihood of instability (things shifting under our feet?).  Or is it
just bad form to be doing things at this low a level?

> -{-# LANGUAGE BangPatterns, ForeignFunctionInterface, CPP, ScopedTypeVariables #-}
> +{-# LANGUAGE BangPatterns, ForeignFunctionInterface, CPP, ScopedTypeVariables, MagicHash #-}

The GHC man page just says

       -XMagicHash
              Allow "#" as a postfix modifier on identifiers.

> +import GHC.Exts ( Int#, Int(I#), Ptr(..), Word# )
> +import GHC.Prim ( indexWord8OffAddr#, (==#), (>=#), (+#), word2Int#, Addr# )

Which I think means that we get to say things like Word# and word2Int#

> -isSpaceWord8 :: Word8 -> Bool
> +isSpaceWord8 :: Word# -> Bool
>  isSpaceWord8 w =
> hunk ./src/ByteStringUtils.hs 139
> -    w == 0x20 ||    -- ' '
> -    w == 0x09 ||    -- '\t'
> -    w == 0x0A ||    -- '\n'
> -    w == 0x0D       -- '\r'
> +    i ==# 0x20# ||    -- ' '
> +    i ==# 0x09# ||    -- '\t'
> +    i ==# 0x0A# ||    -- '\n'
> +    i ==# 0x0D#       -- '\r'
> +  where
> +  i = word2Int# w

I guess I understand this, just use raw machine types to do this comparison?
What's the harm, really?

> hunk ./src/ByteStringUtils.hs 147
> -firstnonspace :: Ptr Word8 -> Int -> Int -> IO Int
> +firstnonspace :: Ptr Word8 -> Int -> Int -> Int

In my naive, purely high-level mental model of Haskell programming, when I see
'Ptr', I think "ooh, deep down murky machine things" which makes in turn think
"err, shouldn't that be in IO?" How does this work?  Am I to treat things like
'indexWord8OffAddr#' as having an implicit unsafePerformIO somewhere?

> -firstnonspace !ptr !n !m
> -    | n >= m    = return n
> -    | otherwise = do w <- peekElemOff ptr n
> -                     if isSpaceWord8 w then firstnonspace ptr (n+1) m else return n
> +firstnonspace (Ptr p) (I# n) (I# m) = I# (first p n m)
> +  where
> +  first :: Addr# -> Int# -> Int# -> Int#
> +  first addr n' m'
> +      | n' >=# m' = n'
> +      | otherwise = if (not (isSpaceWord8 ch))
> +                      then n'
> +                      else first addr (n' +# 1#) m'
> +    where
> +    ch = indexWord8OffAddr# addr n'
> +{-# INLINE firstnonspace #-}

Why the 'not'?  Using my number one metric of "can Eric understand this code?",
could we not just rewrite these ifs into

first addr n' m'
    | n' >=# m' = n'
    | isSpaceWord8 ch = first addr (n' +# 1#) m'
    | otherwise = n'

Or is that where your comment below applies?
> Note: I could have refactored this to be more compact but when I did
> that isSpaceWord8 started returning a boxed bool instead of inlining
> itself which then defeated the purpose.

> +firstspace :: Ptr Word8 -> Int -> Int -> Int
> +firstspace (Ptr p) (I# n) (I# m) = I# (first p n m)
> +  where
> +  first :: Addr# -> Int# -> Int# -> Int#
> +  first addr n' m'
> +      | n' >=# m' = n'
> +      | otherwise = if isSpaceWord8 ch
> +                      then n'
> +                      else first addr (n' +# 1#) m'
> +    where
> +    ch = indexWord8OffAddr# addr n'
> +{-# INLINE firstspace #-}

Gosh, I wish there was a nice high level way to write this with nice
performance something like first f...

-- 
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
PGP Key ID: 08AC04F9
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
URL: <http://lists.osuosl.org/pipermail/darcs-users/attachments/20091002/000d8f6c/attachment.pgp>


More information about the darcs-users mailing list