[darcs-devel] unicode handling

Fri Apr 13 06:55:35 UTC 2012

Hi,

After quite a bit of digging, I discovered there was a simple workaround
for the GHC and unicode filenames problem that's been blocking correct
behaviour on GHC 7.4 on Unix. (http://bugs.darcs.net/patch777)

However, this still has some problems:
 - Filenames with Unicode codepoints >= 256 were previously broken on
Windows and they still are
 - It involves setting GHC-wide global variables, which may be a problem
for library users

I'm still not really sure what the "right" solution to this problem is.

Here follows a general braindump, since I will likely forget about all
this now the immediate problem is sorted :-)

The hashed-storage index relies on mmapping ByteStrings in the on-disk
file via Data.ByteString.Internal.fromForeignPtr and then passing it to
stat via Data.ByteString.Unsafe.unsafeUseAsCString. This code path is
supposed to be fast; it supports quick darcs record/whatsnew, because
the index maps last modified time to a content hash, allowing any file
which hasn't changed since the last index update to be compared with
pristine without actually reading the file.

Independently of any language implementation/library, using ByteString
or an equivalent representation is the right thing to do on POSIX
systems where the raw OS API indeed uses just a string of bytes.

However on Windows, the raw OS APIs give UTF16. Currently hashed-storage
truncates this to 8 bits on read to stuff it into a ByteString, hence
the above-mentioned bug with codepoints >= 256. One alternative would be
to explode the UTF16 into a ByteString using two bytes per character,
although I'm not sure if there's a good API for doing that translation
fast (though at the moment we go via String even on the Unix side, so it
can't really get worse).

Right now hashed-storage mostly gets paths from the OS as String,
converts to ByteString and then converts back to String when calling the
OS again. I think darcs itself is similar, although patches themselves
store paths as String. The Printer code stores things as either String
or ByteString, or in a few cases both at once (I think this is just for
constant strings like punctuation).

This is all a bit of a mess :-) Pushing some newtypes into the code
seems like a natural first step to clarifying representations and intent.

Ganesh