[darcs-devel] [issue2648] `darcs convert import` double-encodes cyrillic characters in UTF-8 input stream

Ben Franksen bugs at darcs.net
Tue Jul 14 16:33:58 UTC 2020


Ben Franksen <ben.franksen at online.de> added the comment:

Attached is a patch that hopefully fixes all the encoding issues in
'darcs convert import', including parsing of quoted file paths with
bytes encoded using C's backslash-octal-number notation. (It replaces
and extends Ganeshs' quick fix, so please obliterate that before applying.)

I tested this by creating my own git repo with lots of funny characters
in file paths, meta data and file content, and then manually inspecting
the resulting darcs repo (using darcs log) as well as checking roundtrip
via 'darcs convert export' and comparing the output of 'git log'.

A test case from your side would still be appreciated, ideally as a
tests script.

__________________________________
Darcs bug tracker <bugs at darcs.net>
<http://bugs.darcs.net/issue2648>
__________________________________
-------------- next part --------------
1 patch for repository /home/ben/src/darcs/clean:

patch 9778b53871869e5103be02b33a61a244d345303b
Author: Ben Franksen <ben.franksen at online.de>
Date:   Tue Jul 14 18:27:30 CEST 2020
  * convert import: fix meta data and filepath encoding 
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256


New patches:

[convert import: fix meta data and filepath encoding 
Ben Franksen <ben.franksen at online.de>**20200714162730
 Ignore-this: 2f8174c14f7615d952d6b71a821042ec4bb75cd7ced3b81c5b060102db8d01799d373efa7722a8ed
] hunk ./src/Darcs/UI/Commands/Convert/Import.hs 25
- -import Control.Applicative ((<|>))
+import Control.Applicative ((<|>),many)
hunk ./src/Darcs/UI/Commands/Convert/Import.hs 38
+import Data.Word (Word8)
hunk ./src/Darcs/UI/Commands/Convert/Import.hs 115
- -import Darcs.Util.ByteString (decodeLocale)
+import Darcs.Util.ByteString (decodeLocale, unpackPSFromUTF8)
hunk ./src/Darcs/UI/Commands/Convert/Import.hs 268
- -          let (name, log) = case BC.unpack message of
+          let (name, log) = case unpackPSFromUTF8 message of
hunk ./src/Darcs/UI/Commands/Convert/Import.hs 271
- -              (author'', date'') = span (/='>') $ BC.unpack author
+              (author'', date'') = span (/='>') $ unpackPSFromUTF8 author
hunk ./src/Darcs/UI/Commands/Convert/Import.hs 350
- -                                        BC.unpack branch
+                                        decodeLocale branch
hunk ./src/Darcs/UI/Commands/Convert/Import.hs 358
- -          liftIO $ putStrLn $ "WARNING: Ignoring gitlink " ++ BC.unpack link
+          liftIO $ putStrLn $ "WARNING: Ignoring gitlink " ++ decodeLocale link
hunk ./src/Darcs/UI/Commands/Convert/Import.hs 363
- -            liftIO $ putStrLn ("Tagging branch: " ++ BC.unpack pbranch)
+            liftIO $ putStrLn ("Tagging branch: " ++ decodeLocale pbranch)
hunk ./src/Darcs/UI/Commands/Convert/Import.hs 370
- -          TM.copy (markpath m) (floatPath $ BC.unpack path)
+          TM.copy (markpath m) (decodePath path)
hunk ./src/Darcs/UI/Commands/Convert/Import.hs 374
- -          TM.writeFile (floatPath $ BC.unpack path) (BLC.fromChunks [bits])
+          TM.writeFile (decodePath path) (BLC.fromChunks [bits])
hunk ./src/Darcs/UI/Commands/Convert/Import.hs 378
- -          let floatedPath = floatPath $ BC.unpack path
+          let floatedPath = decodePath path
hunk ./src/Darcs/UI/Commands/Convert/Import.hs 391
- -            TM.copy (floatPath $ BC.unpack from) (floatPath $ BC.unpack to)
+            TM.copy (decodePath from) (decodePath to)
hunk ./src/Darcs/UI/Commands/Convert/Import.hs 398
- -          let uFrom = floatPath $ BC.unpack from
- -              uTo = floatPath $ BC.unpack to
+          let uFrom = decodePath from
+              uTo = decodePath to
hunk ./src/Darcs/UI/Commands/Convert/Import.hs 446
- -              Just n' -> fail $ "FATAL: Mark already exists: " ++ BC.unpack n'
+              Just n' -> fail $ "FATAL: Mark already exists: " ++ decodeLocale n'
hunk ./src/Darcs/UI/Commands/Convert/Import.hs 474
- -                            floatUnpack = floatPath . BC.unpack
hunk ./src/Darcs/UI/Commands/Convert/Import.hs 475
- -                                TM.fileExists $ floatUnpack l
+                                TM.fileExists $ decodePath l
hunk ./src/Darcs/UI/Commands/Convert/Import.hs 581
- -          -- Take until a non-escaped " character.
- -          name <- A.scan Nothing
- -            (\previous char -> if char == '"' && previous /= Just '\\'
- -               then Nothing else Just (Just char))
+          bytes <- many (p_escaped <|> p_unescaped)
hunk ./src/Darcs/UI/Commands/Convert/Import.hs 583
- -          return $ unescape name
- -
+          return $ B.concat bytes
+        p_unescaped = A.takeWhile1 (\c->c/='"' && c/='\\')
+        p_escaped = do
+          _ <- A.char '\\'
+          p_escaped_octal <|> p_escaped_char
+        p_escaped_octal = do
+          let octals :: [Char]
+              octals = "01234567"
+          s <- A.takeWhile1 (`elem` octals)
+          let x :: Word8
+              x = read ("0o" ++ BC.unpack s)
+          return $ B.singleton $ fromIntegral x
+        p_escaped_char =
+          fmap BC.singleton $
+          '\r' <$ A.char 'r' <|> '\n' <$ A.char 'n' <|> A.char '"' <|> A.char '\\'
hunk ./src/Darcs/UI/Commands/Convert/Import.hs 611
- -               liftIO $ putStrLn $ "=== chunk ===\n" ++ BC.unpack chunk ++ "\n=== end chunk ===="
+               liftIO $ putStrLn $ "=== chunk ===\n" ++ decodeLocale chunk ++ "\n=== end chunk ===="
hunk ./src/Darcs/UI/Commands/Convert/Import.hs 614
- -
- --- |unescape turns \r \n \" \\ into their unescaped form, leaving any
- --- other \-preceeded characters as they are.
- -unescape :: BC.ByteString -> BC.ByteString
- -unescape cs = case BC.uncons cs of
- -  Nothing -> BC.empty
- -  Just (c', cs') -> if c' == '\\'
- -    then case BC.uncons cs' of
- -      Nothing -> BC.empty
- -      Just (c'', cs'') -> let unescapedC = case c'' of
- -                                'r'  -> '\r'
- -                                'n'  -> '\n'
- -                                '"'  -> '"'
- -                                '\\' -> '\\'
- -                                x    -> x in
- -        BC.cons unescapedC $ unescape cs''
- -    else BC.cons c' $ unescape cs'
+decodePath :: BC.ByteString -> AnchoredPath
+decodePath = floatPath . decodeLocale

Context:

[TAG 2.15.2
Ganesh Sittampalam <ganesh at earth.li>**20190916154842
 Ignore-this: 49a3b59b9fd79ac55ad4e54388f88b77
] 
Patch bundle hash:
7c97045cd536c1647c3a0210ab03e8edda2b9b14
-----BEGIN PGP SIGNATURE-----

iHUEAREIAB0WIQS1sLTEOCbYp4iyltnTbkUxbljMlwUCXw3dTAAKCRDTbkUxbljM
l1qWAQDAEveiZ2byJQ7qrk8yVApZxK6BEQmjnGbNTE8ID3YNcwEAnli2TtMwGZFY
gAPTA5hSyGUbhfD6e/WWWieHdYKUtWw=
=wMQo
-----END PGP SIGNATURE-----


More information about the darcs-devel mailing list