[darcs-devel] [issue2648] `darcs convert import` double-encodes cyrillic characters in UTF-8 input stream
bugs at darcs.net
Tue Jul 14 13:22:32 UTC 2020
Ben Franksen <ben.franksen at online.de> added the comment:
> I've verified and guarantee, that provided cyrillic_input_stream file
> contains common `git fast-export` UTF-8 encoded stream from valid git
> repo, containing exactly one patch named 'Южноэфиопский грач увёл
> мышь за хобот на съезд ящериц' with exactly one file with cyrillic
> filename 'Панграмма.txt' with exactly one line of text 'Широкая
> электрификация южных губерний даст мощный толчок подъёму сельского
> хозяйства' and is valid for recreating exactly the same git repo by
> piping it on `git fast-import`, which I twice tested manually (on
> 2020-06-29, when creating the patch, and now).
Thanks. I think Ganesh's fix for the meta data is okay. However.
Have you checked that the file names in the darcs repo are also the same
as in the git repo? As I see it, the code that does the file name
conversion on import is utterly broken. It uses (floatPath . BC.unpack),
that is, we take the raw bytes of the input stream. Assuming that is
UTF8 encoded, this cast the bytes of the encoded file names each to Char
and then re-encodes them (floatPath calls encodeLocale for each path
component). This simply cannot work.
Indeed I have tested this just now: importing from a git repo with a
file named "müßig" (german Umlaute and sharp s) converts to darcs as:
ben at home:.../scratch/gtest>ll git
-rw-rw-r-- 1 ben ben 8 Jul 14 13:59 müßig
ben at home:.../scratch/gtest>ll darcs
drwxrwxr-x 6 ben ben 4096 Jul 14 14:02 _darcs
-rw-rw-r-- 1 ben ben 8 Jul 14 14:02 m303274303237ig
However, decoding plain byte sequences properly is not enough. Looking
at the output of git fast-export I see:
add file müßig
M 100644 :1 "m\303\274\303\237ig"
The first line is encoded in UTF8. But the second line uses the quoted
file name convention. The git manual says "A path can use C-style string
quoting." and apparently git interprets that as "escape any non-ASCII
byte". So when the file name is in quoted form we also have to parse
octal escaped bytes, not only the required '\r', '\n' etc. Sigh.
Darcs bug tracker <bugs at darcs.net>
More information about the darcs-devel