[darcs-devel] Fixing the encoding mess

Ben Franksen ben.franksen at online.de
Tue Mar 15 23:12:18 UTC 2016


This has been prompted by issue2480 and is an attempt at clarifying how we 
came to be in the sorry state we are in and what to do about it. Correct me 
if I've got it wrong or if it's overly simplified.

The origin of Darcs' encoding problems is that in ancient times, GHC and the 
base libraries were completely buggy w.r.t. file IO. Things like ByteString 
did not yet exist, so they used String; but they turned raw bytes into 
Haskell's Char (and vice versa) one-to-one without any de- or encoding.  
This "worked" for single byte encodings such as ASCII or ISO Latin and was 
utterly broken for any kind of multibyte encodings such as defined by 
Unicode.

When Unicode and multibyte encodings became so prevalent that the issue 
could no longer be ignored, GHC and the base libraries were fixed. Since 
many years now, they take the user's locale into account and properly en- 
and decode Chars when doing file or stream IO (command line, too).

Now, this change broke Darcs (this was shortly before darcs-2.5) which was 
relying on the previous buggy behavior in very many places. The solution was 
to add a hack that perpetuates the old buggy behavior so that Darcs could go 
on working as before. See issue2095, especially Message15687.

In order to fix the Unicode problems in Darcs at the root, we will have to 
first decide and then specify how we wish Darcs to behave in the future, 
without regards to compatibility or what happened in the past. For patch 
meta data we already have a good story: it is always stored in UTF-8 
encoding. This makes sense, given that the meta data is there for human 
consumption; we just have to make sure it gets re-coded properly according 
to the user's locale, which is standard stuff by now.

For patch content, there are two sane possibilities I can see:

One is to record an encoding together with the data on a file-by-file basis. 
We would need a new form of primitive patch that sets or changes the 
encoding of a file or directory. This would enable people to work on a 
project without having to agree on the same content encoding. Adding this 
new patch type and storing in a way that is compatible with the current 
patch format should be (just) doable; we have to specify the commutation 
axioms and prove that they preserve the laws we expect to hold (assuming 
that they hold now, which we know isn't true, but that's another story), 
which doesn't look too difficult to me.

The other possibility is to always use the user's locale to convert from/to 
the raw bytes manages by Darcs. Perhaps simpler to implement, but content 
that was added by one user might look funny for another one if their locales 
differ. (This would still be better than what we have today, I guess, which 
is to display <U+xxxx>s for non-ASCII byte sequences.)

However we decide, the first step in realizing a proper solution would be to 
revert the hack that perpetuates the buggy behavior of the base libraries. 
This will certainly cause lots of breakage at first. We then need a number 
of good test scripts to specify the desired behavior, as well as to guard 
against regressions with old repositories. It should be clear that one goal 
must be to limit de-/encoding to the very outermost UI layer.

The concrete problem that prompted the hack (issue2095) had to do with the 
content of directories. A possible way forward might be to cut down on the 
number of data types wrapping String/FilePath and instead push the 
internally used AnchoredPath type (which represents paths as lists of 
ByteStrings and thus is completely agnostic with regards to encoding) 
outward as far as possible (that is, until conversion to FilePath is forced 
by some external API).

So much for today...

Cheers
Ben
-- 
"Make it so they have to reboot after every typo." ― Scott Adams




More information about the darcs-devel mailing list