[darcs-users] [patch37] Store textual patch metadata encoded in UTF-8

Mon Nov 9 23:13:42 UTC 2009

Hi Reinier,

On Sun, Nov 01, 2009 at 17:55:50 +0000, Reinier Lamers wrote:
> I attached a diff between the starting point and the end result of these
> patches, as they contain some hunks that are reverted in later patches.

I'm afraid I'm not really very useful for this patch review, but perhaps
I could get the balling rolling with some initial comments.

Also: I only made it past the first patch before I got tired, but
hopefully this is of some use...

Make record store patch metadata in UTF-8
-----------------------------------------
> +-- | A locale-aware version of getArgs
> +--   This function returns the command-line arguments, interpreted in the
> +--   current locale instead of byte-by-byte.
> +getArgsLocale :: IO [String]
> +getArgsLocale = do
> +    byteArgs <- getArgs >>= return . map encodeLatin1
> +    sequence (map decodeLocale byteArgs)

Will this break if getArgs changes to use the current locale?
Will that change happen happen?

> +-- | Decode a ByteString to a String according to the current locale
> +decodeLocale :: B.ByteString -> IO String
> +decodeLocale = runInputT defaultSettings . decode

> +-- | Encode a String to a ByteString according to the current locale
> +encodeLocale :: String -> IO B.ByteString
> +encodeLocale = runInputT defaultSettings . encode

Here we're using Haskeline, perhaps because Haskeline does all the hard
work of portably figuring out the current locale's encoding for us?

Also, I wonder if we should use type synonyms to distinguish between
strings of Unicode Chars and strings of padded bytes (8 bits of stuff
and 24 bits of zero).  I'm not sure if I'm making any sense here, just
thinking that what IO in GHC < 6.12 gives us is the latter.

> +-- | Encode a String to a ByteString with latin1 (i.e., the values of the
> +-- characters become the values of the bytes; if a character value is greater
> +-- than 255, its byte becomes the character value modulo 256)
> +encodeLatin1 :: String -> B.ByteString
> +encodeLatin1 = B.pack . (map (fromIntegral . ord))

Would it be better to explode if the character value is greater than 255?

> +-- | Take a @String@ that represents byte values and re-decode it acording to
> +-- the current locale.
> +decodeString :: String -> IO String
> +decodeString = decodeLocale . encodeLatin1

Should this be something like decodeStringLocale or is it always clear
from context?

> +-- | Convert a bytestring representing a text from UTF-8 to the current locale
> +utf8ToLocale :: B.ByteString -> IO B.ByteString
> +utf8ToLocale bs = encodeLocale string
> +  where string = toString bs

Looks like you could do a minor point-free refactor here.
TODO: understand unpackPSFromUTF8 (was toString)

>  get_easy_author :: IO (Maybe String)
> -get_easy_author = firstJustIO [ firstNotBlank `fmap` get_preflist "author",
> -                                firstNotBlank `fmap` get_global "author",
> -                                maybeGetEnv "DARCS_EMAIL",
> -                                maybeGetEnv "EMAIL" ]
> +get_easy_author = do
> +    undecodedAuthor <-
> +      firstJustIO [ firstNotBlank `fmap` get_preflist "author",
> +                    firstNotBlank `fmap` get_global "author",
> +                    maybeGetEnv "DARCS_EMAIL",
> +                    maybeGetEnv "EMAIL" ]
> +    case undecodedAuthor of
> +        Just a  -> Just `fmap` decodeString a
> +        Nothing -> return Nothing

Does undecodedAuthor mean something different than encodedAuthor?

I'm guessing here we're expecting get_preflist, get_global and
maybeGetEnv to return locale-encoded padded bytes

> +import Data.Char ( ord )
> +                          putStrLn ("Patch name as received from getLog: " ++ show (map ord name))

Do we need a nice abstract function for this?

> -                 mlp <- lines `fmap` readBinFile f `catch` (\_ -> return [])
> +                 mlp <- lines `fmap` readLocaleFile f `catch` (\_ -> return [])

> -                                  writeBinFile logf $ unlines $ p : default_log
> +                                  -- TODO: make sure encoding used for logf is the same everywhere
> +                                  -- probably should be locale because the editor will assume it
> +                                  writeLocaleFile logf $ unlines $ p : default_log

> -              do t <- (lines.filter (/='\r')) `fmap` readBinFile f
> +              do t <- (lines.filter (/='\r')) `fmap` readLocaleFile f
>                   case t of [] -> return (oldname, [], Just f)

> -                          _ | null (lines fc) -> hPutStrLn h oldname
> -                            | last fc /= '\n' -> hPutStrLn h ""
> +                          _ | null (lines fc) -> encodeLocale (oldname ++ "\n") >>= B.hPut h
> +                            | last fc /= '\n' -> encodeLocale "\n"              >>= B.hPut h

Are all the 8-bit encodings in use ASCII compatible?

> hunk ./src/Darcs/Lock.hs 24
> hunk ./src/Darcs/Lock.hs 293
>  readBinFile :: FilePathLike p => p -> IO String
>  readBinFile = fmap BC.unpack . B.readFile . toFilePath
>  
> +-- | Reads a file. Differs from readBinFile in that it interprets the file in
> +--   the current locale instead of as ISO-8859-1.
> +readLocaleFile :: FilePathLike p => p -> IO String
> +readLocaleFile f = decodeLocale =<< B.readFile (toFilePath f)

Good clarification

> +-- | Writes a file. Differs from writeBinFile in that it writes the string
> +--   encoded with the current locale instead of what GHC thinks is right.
> +writeLocaleFile :: FilePathLike p => p -> String -> IO ()
> +writeLocaleFile f s = writeToFile f $ \h -> encodeLocale s >>= B.hPut h

Why the difference between "what GHC thinks is right" here and
ISO-8859-1 above?

> -       return $ pinf { _pi_log = BC.pack (head ignored++showHex x ""):
> +       return $ pinf { _pi_log = BC.pack (head ignored++showHex x ""++" UTF-8"):
>                                   _pi_log pinf }

Thanks for the Ignore-this: mechanism, David.  It may be ugly, but it works!

> -askUser :: String -> IO String
> +-- | Ask the user for a line of input.
> +askUser :: String    -- ^ The prompt to display
> +        -> IO String -- ^ The string the user entered.
>  askUser prompt = withoutProgress $ runInputT defaultSettings $
>                      getInputLine prompt
>                          >>= maybe (error "askUser: unexpected end of input") return
> hunk ./src/Darcs/Utils.hs 159
> -            -- Return the input as encoded, 8-bit Chars (same as the
> -            -- non-Haskeline backend).
> -                        >>= fmap B.unpack . encode

TODO: think about implications of this change.
Will this cause any problems for http://bugs.darcs.net/issue1292 ?

>    handleJust assertions bug $ do
> -  argv <- getArgs
> +  argv <- getArgsLocale

Probably the only place we'll use getArgsLocale :-)

----------------------------------------------------------------------------
Add test for UTF-8 metadata
Add tests to check that amend-record stores metadata in UTF-8
Make UTF-8 test a bit more compact
Add tests for rollback's storing metadata as UTF-8
Test that patches with non-latin1 metadata can be safely moved between repos
----------------------------------------------------------------------------

Trent: could you comment on the utf8.sh test?  If so, I request that it
be bubbled up near the top of your priority list.  (By rights, I should
really be reading the tests too because they would give a better
understanding of the code and the pitfalls, but my excuse is that this
is just a first pass to get an idea what we're up against)
-- 
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
PGP Key ID: 08AC04F9
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
URL: <http://lists.osuosl.org/pipermail/darcs-users/attachments/20091109/3c1e8f8a/attachment.pgp>