[darcs-users] [patch37] Store textual patch metadata encoded in UTF-8

Reinier Lamers tux_rocker at reinier.de
Fri Nov 13 07:48:08 UTC 2009


Op dinsdag 10 november 2009 16:12 schreef Eric Kow:
> I think this is ready to go pending some questions (marked COMMENT
> ) below.

There have been some high-level criticisms and I think it's good to wait for
the discussion about those to settle before applying.
 
> Also, I know this is the second time I'm asking this (but this time with some
> variants which you may happier with than my first request) : I'd like you to
> consider a second time the possible merits of using diff+patch to submit a nice
> clean bundle without the rollbacks or conflicts and with some patches
> consolidated or divided up as you see fit.  To make this simpler for you,
> perhaps do this only against the patch bundle context and not the current repo
> state (let somebody else worry about the recent conflicts).

I believe I did that. It's there on the patch tracker. If those are not what you
mean, please elaborate.

> On Sun, Nov 01, 2009 at 17:55:50 +0000, Reinier Lamers wrote:
> > The patches make darcs treat old-style patches as locale-encoded.
> 
> COMMENT: So let me make sure I understand the implications of this.  Before this
> bundle, Darcs would treat patches as just a sequence of bytes.  From the
> user's perspective is that effectively equivalent to a Unicode-aware
> darcs treating them as locale-encoded?

Yes.

The matter is a bit more complicated btw. When using the patch metadata as a patch
identifier, darcs considered the metadata as a sequence of bytes. In this case, it
will continue to do so even with this patch.

This patch only changes the behavior of darcs when interpreting the metadata as
text for display or for matching on it.

>  TODO: understand unpackPSFromUTF8 (was toString)

It takes a UTF-8-encoded byte sequence and produces a String that represents the
same Unicode character sequence.

>  TODO: think about implications of this change. [re: Darcs.Utils]

>  For the second
> TODO, I think that's just making askUser return a proper Unicode string

Yes, it's fine. That was a hack that was necessary because Darcs.Patch.Info did
not handle Unicode well. Now it does, so the hack is no longer necessary.

> Make amend-record store metadata as UTF-8
> -----------------------------------------
> Is this patch a misnomer?  It seems like these PatchInfo modifications are
> quite general and while I could be convinced that they would affect
> amend-record (with no changes to the actual AmendRecord command), perhaps
> the name should have reflected that generality.

It reflects the process. I did some sort of TDD, and I first wrote tests for
amend-record. Then I started changing stuff so those tests passed, and then I
recorded those changes as "make amend-record store metadata as UTF-8". But
among those changes, there were some that are quite general.

> >  just_author :: PatchInfo -> String
> > hunk ./src/Darcs/Patch/Info.hs 109
> > -just_author = BC.unpack . _pi_author
> > +just_author = toString . _pi_author
> 
> > hunk ./src/Darcs/Patch/Info.hs 113
> > -    text (friendly_d $ _pi_date pi) <> text "  " <> packedString (_pi_author pi)
> > - $$ hfn (_pi_name pi)
> > +    text (friendly_d $ _pi_date pi) <> text "  " <> text (pi_author pi)
> > + $$ hfn (pi_name pi)
> >   $$ vcat (map ((text "  " <>) . packedString) (ignore_junk $ _pi_log pi))
> 
> COMMENT: Is there a reason we don't also remove these packedString lines from
> the bit that prints the patch log?

No, there isn't. But it doesn't matter because the Printer module prints
numerical escapes for non-ASCII characters anyway afaik.

> I'm not 100% clear on why we remove these packedString calls for that matter.
> For the interested, this comes from the Printer module:
> 
> -- | 'packedString' builds a 'Doc' from a 'B.ByteString' using 'printable'
> packedString :: B.ByteString -> Doc
>  
> Does the Printer module have to be updated to deal with UTF-8 bytestrings as
> well?

Yes. I am planning to do that in a separate bundle that makes darcs actually
display its Unicode correctly. This bundle only makes darcs store it correctly
in the filesystem.

> Should we have separate types to distinguish between UTF-8 bytestrings
> (used for patch metadata) and unknown-encoding-bytestrings (perhaps used for
> patches themselves)?

That would be nice. There's a compact-string package on hackage that provides
more or less that, but it was unpopular and died for some reason.

> >  pi_name :: PatchInfo -> String
> > -pi_name = BC.unpack . _pi_name
> > +pi_name = toString . _pi_name
> >
> > -pi_author = BC.unpack . _pi_author
> > +pi_author = toString . _pi_author
> >
> > -pi_log = map BC.unpack . ignore_junk . _pi_log
> > +pi_log = map toString . ignore_junk . _pi_log
> >
> >  pi_tag :: PatchInfo -> Maybe String
> >  pi_tag pinf =
> > hunk ./src/Darcs/Patch/Info.hs 158
> >      if l == t
> > -      then Just $ BC.unpack r
> > +      then Just $ toString r
> >        else Nothing
> >      where (l, r) = B.splitAt (B.length t) (_pi_name pinf)
> > hunk ./src/Darcs/Patch/Info.hs 161
> > -          t = BC.pack "TAG "
> > +          t = fromString "TAG "
> 
> More assuming that we have UTF-8 encoded bytestrings in our patchinfo
> (this is also assuming that we do the conversion for old-style patches
> somewhere else)

This code is changed somewhere later on in the bundle.

> >  pi_rename :: PatchInfo -> String -> PatchInfo
> > -pi_rename x n = x { _pi_name = BC.pack n }
> > +pi_rename x n = x { _pi_name = fromString n }
> >
> >  set_pi_date :: String -> PatchInfo -> PatchInfo
> > -set_pi_date date pi = pi { _pi_date = BC.pack date }
> > +set_pi_date date pi = pi { _pi_date = fromString date }
> 
> Also make sure we store PatchInfo as UTF-8 bytestrings
> 
> >  to_xml :: PatchInfo -> Doc
> >  to_xml pi =
> >          text "<patch"
> > -    <+> text "author='" <> escapeXML (just_author pi) <> text "'"
> > -    <+> text "date='" <> escapeXML (BC.unpack $ _pi_date pi) <> text "'"
> > +    <+> text "author='" <> escapeXMLByteString (_pi_author pi) <> text "'"
> > +    <+> text "date='" <> escapeXMLByteString (_pi_date pi) <> text "'"
> >      <+> text "local_date='" <> escapeXML (friendly_d $ _pi_date pi) <> text "'"
> 
> Again: should local the date also be included to this, since it may one
> day go locale-specific?

It may one day go locale-specific, but I think we should do the necessary
plumbing no earlier than that day. For now, the date strings are ASCII and
there's no need to send them through all kinds of Unicode loops.

> > +-- Escape XML characters in a UTF-8 encoded ByteString, and turn it into a Doc.
> > +-- The data will be in the Doc as a bytestring.
> > +escapeXMLByteString :: B.ByteString -> Doc
> > +escapeXMLByteString = packedString . bstrReplace '\'' "&apos;"
> > +                                   . bstrReplace '"'  "&quot;"
> > +                                   . bstrReplace '>'  "&gt;"
> > +                                   . bstrReplace '<'  "&lt;"
> > +                                   . bstrReplace '&'  "&amp;"
> 
> This looks like a direct translation of espaceXML.  Perhaps some sort of
> refactor would be good using a higher-order function?  Or is espaceXML just
> going away?

I hope the latter.

> >  strReplace :: Char -> String -> String -> String
> >  strReplace _ _ [] = []
> > hunk ./src/Darcs/Patch/Info.hs 211
> >    | x == z    = y ++ (strReplace x y zs)
> >    | otherwise = z : (strReplace x y zs)
> >  
> > +bstrReplace :: Char -> String -> B.ByteString -> B.ByteString
> > +bstrReplace c s bs | B.null bs   = B.empty
> > +                   | otherwise   = if BC.head bs == c
> > +                                     then B.append (BC.pack s)
> > +                                                   (bstrReplace c s (B.tail bs))
> > +                                     else B.cons (B.head bs)
> > +                                                 (bstrReplace c s (B.tail bs))
> > +
> 
> Yuck.  I wonder if there are generic functions that can replace both strReplace
> and bstrReplace.
> 
> Also, it looks like bstrReplace could do with some sharing of the head and tail,
> that we could do more to make it look like strReplace.

The main cause for looking different from strReplace is that we can't
pattern-match on the characters in bytestrings.

> Make _darcs/prefs/author be locale-encoded
> ------------------------------------------
> > conflictor [
> > hunk ./src/Darcs/Arguments.lhs 831
> > -          writeFile (darcsdir++"/prefs/author") add
> > +          writeFile (darcsdir ++ "/prefs/author") $
> > +                    unlines ["# " ++ line | line <- fileHelpAuthor] ++ add
> > ]
> > :
> > hunk ./src/Darcs/Arguments.lhs 831
> > -          writeFile (darcsdir++"/prefs/author") add
> > +          writeLocaleFile (darcsdir++"/prefs/author") add
> 
> Huh, so I wonder if this is a nice effect of the conflictor representation;
> that you can easily read the actual patch as the part that follows the ':'
> 
> Otherwise, this is as you promised in the description to this patch bundle above

Note: perhaps we should document this somewhere. While it is not really for
end-user documentation (the end user should just edit the config file in a text
editor and have it Just Work), it is important for developers to know that our
config files are supposed to be locale-encoded.

> >      case undecodedAuthor of
> > -        Just a  -> Just `fmap` decodeString a
> > +        Just a  -> return (Just (decodeString a))
> >          Nothing -> return Nothing
> 
> Looks like you could just replace that with something like
> 
>    return (decodeString `fmap` undecodedAuthor)

> Stop handling all command line arguments as Unicode text
> --------------------------------------------------------
> > --- | A locale-aware version of getArgs
> > ---   This function returns the command-line arguments, interpreted in the
> > ---   current locale instead of byte-by-byte.
> > -getArgsLocale :: IO [String]
> > -getArgsLocale = do
> > -    byteArgs <- getArgs >>= return . map encodeLatin1
> > -    return (map decodeLocale byteArgs)
> 
> Oh!  This seems to rollback part of your very first patch.  So now we stop
> assuming that all the args are Unicode text, only a small subset of
> them.  Why?  Is this related to filenames?

Yup. Filenames are sequences of bytes on POSIX, and when we treat command line
arguments as Unicode text, we lose the ability to open files that have names that
are invalid in the current locale.

> Right now the subset consists of --author, --from and -m, i.e. things which
> appear in the patch metadata
> 
> > +from_opt = DarcsArgOption [] ["from"] (Author . decodeString) "EMAIL"
> > +                          "specify email address"
> 
> Well, if we're going to do this, why not --to, --cc and --subject as well?

Because they're not stored as patch metadata. It might be a good idea to
decode them anyway, I believe I once made a patch to Darcs.Email that assumed
that that would happen one day.

> Normalize Unicode of the patch metadata for comparison.
> -------------------------------------------------------
> Skimming http://www.unicode.org/reports/tr15/, the context behind this
> patch is that even within Unicode there is often more than one way to
> represent the same character (excuse me if I'm committing a horrible
> abuse of language here).
> 
> For example, the c with a cedilla could be represented either as the
> single 'ç' or as plain old 'c' plus a cedilla diacritic.  There are
> other kinds of equivalences too, for example, the order of diacritics
> for characters that have more than one (eg in Vietnamese).  Luckily,
> Unicode also defines equivalences and normal forms for the above.  In
> this patch, Reinier has chosen the NFC form, which basically gives us
> the composed form (eg. ç) wherever applicable.
> 
> What we want to avoid is the situation where Darcs thinks that two patch
> metadata strings are different merely because they use different
> representations for the same characters.  I'm not sure how this
> situation could arise or why we think it's necessary to protect against
> it, but instinctively it does feel right to want to do this.  This is
> the sort of thing I'd like to pass up to some sort of Unicode guy, but I
> guess John Cowan has approved it...
> 
> COMMENT: Are there any risks of this getting us into trouble for
> old-style patches?  Is it possible that the normalisation process would
> somehow mess them up, for example, by transforming a sequence of bytes
> that just so happens to look like UTF-8-encoded 'c followed by a
> cedilla' into UTF-8 'ç'?

No, there are no such risks.

The normalization is not supposed to ever touch old-style patches. Metadata
are only normalized when they are newly recorded.

> >  This also includes two other changes: the change from the utf8-string
> >  package to the text package for encoding matters
> 
> What's the motivation behind this?
> Also, why do we use the 0.3 version of text and not the latest 0.5?

The motivation is that text promises to supply normalization via the
text-icu package. That didn't work on my Ubuntu Jaunty Jackalope so I wrote
an FFI wrapper for normalization myself, but I hope we can switch to text-icu
in a year or so.

Besides, text uses UTF-16, which is what ICU's normalization routines expect.
If we would keep using utf8-string, we would have to use another package anyway
to convert to UTF-16 (or write our own code, which may have subtle bugs).

> > +  extra-libraries:  icuuc
> > +
> 
> How portable a change is this?  On my Mac, I saw that I had a libicucore
> and not a libicuuc.  MacPorts says it has a libicu package.  Do we just
> get used to requiring people to install that like they might libz?

ICU is quickly becoming a core library that's present almost anywhere, much
like libz. It is maintained by IBM afaik. I don't expect any portability
problems from it.

> Ah, I see you've also noticed the text-icu package; why doesn't it
> work on major Linux distributions?

Because it requires ICU version 4, but many Linux distros still ship version
3.8.

> > +unpackPSFromUTF8 :: B.ByteString -> String
> > +unpackPSFromUTF8  = T.unpack . decodeUtf8
> > +
> > +packStringToUTF8 :: String -> B.ByteString
> > +packStringToUTF8 = encodeUtf8 . T.pack
> 
> From UTF-8 bytestring to UTF-16 Text to String and back.  Maybe we
> should just be shipping Text around whenever we know we're dealing
> with Unicode instead of UTF-8 bytestrings?

Hmmmyeah, but that would be another big overhaul of the code. Besides, String
is the native Haskell type for text and it can be expected that we'll have to
convert our Text's to String's anyway to call external code.

> > +-- | Convert a String to a ByteString by encoding it with UTF-8 and normalizing
> > +-- it to Normalized Form C. For the why and how of normalization, see
> > +-- http://www.unicode.org/reports/tr15/ .
> > +packStringToUTF8NFC :: String -> B.ByteString
> > +packStringToUTF8NFC = encodeUtf8 . normalizeText . T.pack
> 
> Is there a reason to prefer NFC to NFD?  Is the choice entirely
> arbitrary so long as we're using a normal form?

I believe that Unicode-technically, the choice is arbitrary. But NFC is more
compact in the way it represents accented characters, so I chose that. Also,
most UTF-8 text in European languages in the wild also contains the accented
characters in 'composed' form, so with NFC there will presumably be less cases
where the normalization will have to actually change the input string.

> > +foreign import ccall "unicode/unorm.h unorm_normalize_3_8" raw_normalize :: Ptr Word16 -> Int32 -> CInt -> Int32 -> Ptr Word16 -> Int32 -> Ptr CInt -> IO Int32
> 
> So for the interested, unorm_normalize has this documentation.
> 
> int32_t unorm_normalize         (       const UChar *            source,
>                 int32_t         sourceLength,
>                 UNormalizationMode      mode,
>                 int32_t         options,
>                 UChar *         result,
>                 int32_t         resultLength,
>                 UErrorCode *    status   
>         )       
> 
> COMMENT: Do we need to care about the potential difference between
> UTF-16 and ICU's UChar type (which is also 16 bits wide)?

No, the ICU documentation explicitly states that its UChar is 16 bits wide.

> > +-- | Normalize a Text according to Normal Form C (NFC).
> > +--   This home grown binding of ICU is inefficient, so do not use it for longer
> > +--   strings. Hopefully it can be removed in the 2.5 release or so when text-icu 
> > +--   works on the major Linux distributions.
> > +normalizeText :: Text -> Text
> > +normalizeText s = unsafePerformIO $ T.useAsPtr s $ \sptr slen ->
> > +                      allocaBytes bufSizeBytes $ \tptr ->
> > +                          let len = fromIntegral slen :: Int32
> > +                          in do resultLength <- handleError $ \errptr ->
> > +                                                    raw_normalize sptr len normalizationType unicode32 tptr (fromIntegral bufSizeBytes) errptr
> > +                                T.fromPtr tptr (fromIntegral resultLength)
> 
> Otherwise, I'm reading this as
> - create a copy of the Text in the form of a C array
> - create a new array that's 3 times the size of the original UTF-16 array
>   (notice the *2 below because of 16 as opposed to 8 bits)
> - pass both to unorm_normalize along with size information.
> 
> > +{-|
> > + Provides simple (i.e. abort-if-anything-wrong) error handling for ICU
> > + functions.
> > +
> > + Takes as an argument a function that writes an ICU error code to a certain
> > + memory address (like most ICU4C functions do).
> > +
> > + This function runs the given function, giving it a memory address to write the
> > + error code to. When the given function indicates an error, it aborts the
> > + program. Otherwise it just returns the result.
> > +-}
> > +handleError :: (Ptr CInt -> IO a) -> IO a
> > +handleError f = alloca $ \errptr ->
> > +                    do poke errptr 0
> > +                       result <- f errptr
> > +                       errorCode <- peek errptr
> > +                       when (errorCode > 0) (error (errMsg ++ show errorCode))
> > +                       return result
> > +  where errMsg = "darcs ByteStringUtils: error returned by ICU4C function: "
> 
> If anything goes wrong during normalisation, our head explodes, which looks
> fair enough.
> 
> Note that I tend to get nervous and want to run to a grown-up when I see this
> low-level stuff.  I think what I'm reading just says "allocate space for a C
> int, pass f a pointer to that int, and explode if we read back a non-zero value
> for that int"

Let this grown-up assure you that that is a correct interpretation of what this is
supposed to do :-)
> 
> > hunk ./src/Darcs/Email.hs 7
> >        && ord c < 128                    = c2bs c
> > -    | otherwise                         = B.concat (map qbyte (UTF8.encode [c]))
> > +    | otherwise                         = B.concat
> > +                                            (map qbyte
> > +                                              (B.unpack
> > +                                                (packStringToUTF8 [c])))
> 
> This looks like merely replacing the utf8-string stuff with the equivalent from
> Text
> 
> >  patchinfo :: String -> String -> String -> [String] -> IO PatchInfo
> >  patchinfo date name author log =
> > -    add_junk $ PatchInfo { _pi_date     = fromString date
> > -                         , _pi_name     = fromString name
> > -                         , _pi_author   = fromString author
> > -                         , _pi_log      = map fromString log
> > +    add_junk $ PatchInfo { _pi_date     = BC.pack date
> > +                         , _pi_name     = packStringToUTF8NFC name
> > +                         , _pi_author   = packStringToUTF8NFC author
> > +                         , _pi_log      = map packStringToUTF8NFC log
> >                           , is_inverted  = False }
> 
> We systematically normalise patch metadata to NFC
> 
> COMMENT: I wonder if there is any risk of manipulating the PatchInfo
> using functions that don't packToUTF8NFC

Yes, there is. I have contained this risk by (a) warning in the PatchInfo
documentation that you shouldn't use the member fields and (b) writing tests
for almost every imaginable way to create or change patch metadata.

> >  -- | add_junk adds a line that contains a random number to make the patch
> 
> Sounds like we also need to update this haddock
> 
> > hunk ./src/Darcs/Patch/Info.hs 87
> > -       return $ pinf { _pi_log = BC.pack (head ignored++showHex x ""++" UTF-8"):
> > +       return $ pinf { _pi_log = BC.pack (head ignored++showHex x ""):
> > +                                 BC.pack (head ignored++ "UTF-8"):
> >                                   _pi_log pinf }
> 
> Oh! I completely missed that you had written this all on one line the
> first time.  Yes, I think I like the new way better.
> 
> [skipped a bunch of toString => metadataToString conversions]
> 
> > +-- | Is the metadata in this PatchInfo UTF-8-encoded?
> > +isUTF8 :: PatchInfo -> Bool
> > +isUTF8 = any (==(BC.pack "Ignore-this: UTF-8")) . _pi_log
> 
> No surprises.  A PatchInfo is new-style if it has an Ignore-this: line.
> Note the use of _pi_log here instead of pi_log (which does friendly
> things like stripping out the Ignore-this lines)
> 
> > +-- | Convert a metadata ByteString to a string, with the exact way of conversion
> > +-- depending on the version of darcs that created the PatchInfo. In newer darcs
> > +-- the metadata are UTF-8, older darcs was not encoding-aware.
> > +metadataToString :: PatchInfo -> B.ByteString -> String
> > +metadataToString pinf | isUTF8 pinf = unpackPSFromUTF8
> > +                      | otherwise   = decodeLocale
> 
> In contrast, we store the underlying metadata as NFC-normalised UTF-8
> bytestrings.
> 
> I wonder if we should have a metadataFromString :: String -> B.ByteString
> 
> Also it does sound like I'd need to use Text to represent the metadata
> if we're
> going to be doing that anyway.
> 
> Add C wrappers for ICU functions to work around changing symbol names
> ---------------------------------------------------------------------
> I don't really understand this patch
> 
> > -foreign import ccall "unicode/unorm.h unorm_normalize_3_8" raw_normalize :: Ptr Word16 -> Int32 -> CInt -> Int32 -> Ptr Word16 -> Int32 -> Ptr CInt -> IO Int32
> > +foreign import ccall "fpstring.h normalize" raw_normalize :: Ptr Word16 -> Int32 -> CInt -> Int32 -> Ptr Word16 -> Int32 -> Ptr CInt -> IO Int32
> 
> Specifically, why not just import unorm_normalize?  Perhaps because it's
> a just macro and we can't do that with the FFI?

Something like that. It's not really a macro, but somehow the "symbol name"
(the identifier for the machine code for the unorm_normalize function in the
shared library file) is not unorm_normalize, but unorm_normalize_VERSION. You
can see that above: I imported "unorm_normalize_3_8" instead of just
"unorm_normalize". To make the code work on ICU versions other than 3.8, I 
wrote a C wrapper that has a symbol name that *we* control. And unlike GHC,
the C compiler knows how to link the call to "unorm_normalize".

If anyone knows the actual mechanism that ICU uses to encode the version in
the symbol name, I'd be glad to know. I also thought about macros, but I
couldn't find any.

> >  normalizeText :: Text -> Text
> > -normalizeText s = unsafePerformIO $ T.useAsPtr s $ \sptr slen ->
> > +normalizeText s = unsafePerformIO $ T.useAsPtr s $ \sptr slen -> do
> 
> Superfluous change?

Perhaps so.

> > +/* Unicode normalization: wrap unorm_normalize because ICU  C encodes the library
> > + * version in the symbol name. */
> > +int32_t normalize(UChar *source, int32_t sourceLength, UNormalizationMode mode,
> > +                  int32_t options, UChar *result, int32_t resultLength,
> > +                  UErrorCode *status)
> > +{
> > +    return unorm_normalize(source, sourceLength, mode, options,
> > +                           result, resultLength, status);
> > +}
> > +
> > +/* Check Unicode normalization. Again, wrap the function because ICU's
> > + * symbol names are not stable across ICU versions. */
> 
> Is there a good reason for that?
> Is there any reason to think it would be unwise to defeat this?

I don't know the reason. But C code is portable across ICU versions, so why
shouldn't our Haskell code be?

> > +UBool isNormalized(const UChar *src, int32_t srcLength, UNormalizationMode mode,
> > +                   UErrorCode *pErrorCode)
> > +{
> > +    return unorm_isNormalized(src, srcLength, mode, pErrorCode);
> > +}
> 
> This looks like it's just used by tests.
> 
> Add tests for function of UTF-8 marker line and for NFC normalization
> ---------------------------------------------------------------------
> I think I'd be happier if the modifications to ByteStringUtils, etc
> were in a separate patch.
> 
> >                            in do resultLength <- handleError $ \errptr ->
> > -                                                    raw_normalize sptr len normalizationType unicode32 tptr (fromIntegral bufSizeBytes) errptr
> > +                                                    raw_normalize sptr len nfcType 0 tptr (fromIntegral bufSizeBytes) errptr
> 
> Is there a reason we replace 0x20 by 0?

Yes. Yes Yes YES! I spent two weeks debugging that. 0x20 tells ICU to do
normalization compatible with older versions of Unicode. I thought that that
was a good idea, making sure that newer versions of darcs always normalize the
way older versions did.

However, isNormalized does not take this into account, and flags an error when
you pass it a backwards-compatible normalization that differs from the
normalization that the current version of Unicode uses. Such a normalization
difference is extremely rare, so I had a quickCheck test that failed only once
in 2000 tries or so :-(.

> > +-- The members with names that start with '_' are not supposed to be used
> > +-- directly in code that does not care how the patch info is stored.
> 
> Useful haddock.  Are they even meant to be exported?

Yes, for the tests.

> > +instance Arbitrary UnicodeString where
> > +    -- 0x10ffff is the highest Unicode code point ; 0xd800 - 0xdfff are
> > +    -- surrogates.
> > +    arbitrary = UnicodeString `fmap` listOf (oneof [choose ('\0', '\xd799')
> > +                                                   ,choose ('\xe000', '\x10ffff')])
> 
> http://www.unicode.org/glossary/#surrogate_code_point
> http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Surrogates
> 
> OK I'm just trusting this to mean that in the test we never bother generating
> characters that are not Unicode characters per se (eg. UTF-16 artefacts don't
> count)

True. Surrogates make Unicode functions insert all sorts of error replacement
characters (like the things that get shown as question mark diamonds in
Firefox), thus making tests fail.

> > +instance Arbitrary MyPatchInfo where
> > +    arbitrary = do n <- asString `fmap` arbitrary
> > +                   d <- arbitrary
> > +                   a <- asString `fmap` arbitrary
> > +                   l <- (lines . asString) `fmap` arbitrary
> > +                   let arbPatchInfo = unsafePerformIO $ patchinfo n d a l
> > +                   MyPatchInfo `fmap`
> > +                       (oneof [return arbPatchInfo
> > +                              ,return (deleteUTF8Marker arbPatchInfo)])
> > +    shrink mpi = flip withMyPatchInfo mpi $ \pi -> do
> > +        sn <- shrink (pi_name pi)
> > +        sa <- shrink (pi_author pi)
> > +        sl <- shrink (filter (not . isPrefixOf "Ignore-this:") (pi_log pi))
> > +        return (MyPatchInfo (unsafePerformIO $
> > +                                 patchinfo sn (BC.unpack (_pi_date pi)) sa sl))
> 
> Randomly generates a patchinfo either for a new-style or old-style patch
> 
> > +-- | Test that metadata in patches are decoded as UTF-8 or locale depending on
> > +-- the relevant 'Ignore-this' line in the log.
> > +metadataStringTest :: Test
> > +metadataStringTest = testProperty "Testing patch metadata decoding" $
> > +    withMyPatchInfo $
> > +        \patchInfo -> classify (isUTF8 patchInfo) "UTF-8-encoded" $
> > +            let decoder | isUTF8 patchInfo = unpackPSFromUTF8
> > +                        | otherwise        = decodeLocale
> > +            in (  decoder (_pi_author patchInfo)  == pi_author patchInfo
> > +               && decoder (_pi_name   patchInfo)  == pi_name   patchInfo
> > +               && map decoder (_pi_log patchInfo) `superset` pi_log patchInfo)
> 
> Is this test useful if your locale uses a UTF-8 encoding?

No, it's not. But I have a nl_NL at euro locale that uses the latin9 encoding to
test this as well.

> > +packUnpackTest :: Test
> > +packUnpackTest = testProperty "Testing UTF-8 packing and unpacking" $
> > +    \uString -> asString uString == (unpackPSFromUTF8 . packStringToUTF8) (asString uString)
> 
> This looks like it's more testing the low-level ByteStringUtils
> 
> > +superset :: (Eq a, Ord a) => [a] -> [a] -> Bool
> > +superset a b = sorted_superset (sort a) (sort b)
> > +  where sorted_superset (x:xs) (y:ys) | x == y = sorted_superset xs ys
> > +                                      | x <  y = sorted_superset xs (y:ys)
> > +                                      | y <  x = False
> > +        sorted_superset []     (_:_)           = False
> > +        sorted_superset _      []              = True
> 
> I suppose we could have a more conservative test instead, seeing whether
> b can be split into some prefix and suffix of a
> 
> > +-- | Test that metadata strings are normalized to NFC
> > +metadataNormalizationTest :: Test
> > +metadataNormalizationTest = testProperty "Testing metadata normalization" $
> > +    withMyPatchInfo $
> > +        \patchInfo -> isUTF8 patchInfo ==>
> > +                          isNFC (pi_author patchInfo)
> > +                          && isNFC (pi_name patchInfo)
> > +                          && all isNFC (pi_log patchInfo)
> 
> Looks good
> 
> > +-- | Generate an arbitrary list of at least one element
> >  unempty :: Arbitrary a => Gen [a]
> >  unempty = do
> > hunk ./src/Darcs/Test/Patch/Test.hs 162
> > +  a <- arbitrary
> >    as <- arbitrary
> > hunk ./src/Darcs/Test/Patch/Test.hs 164
> > -  case as of
> > -    [] -> unempty
> > -    _ -> return as
> > +  return (a:as)
> 
> Seems like just a minor refactor, replacing a function that generates
> a non-empty list with a cleaner variation

Wrong. The previous version could hang when the QuickCheck size parameter was
so low that QuickCheck would only generate empty lists. So this fixes a nasty
bug.

Bye,
Reinier
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.osuosl.org/pipermail/darcs-users/attachments/20091113/11ff7022/attachment-0001.pgp>


More information about the darcs-users mailing list