[darcs-users] Fwd: UTF8 patch

Thu Dec 4 14:04:04 UTC 2008

2008/12/1 Eric Kow <kowey at darcs.net>:
> On Sun, Nov 30, 2008 at 16:34:43 -0500, Gwern Branwen wrote:
>> But as I said before, so far as I know,
>> we shouldn't have to care about possibly erroneous strings because the
>> only function in UTF8.lhs is 'encode :: String -> [Word8]', and the
>> Haskell runtime will never give us a malformed String. (If we do get a
>> bad string, certainly nothing encode can do will make the situation
>> worse.)
>
> So, I was just about to apply this last night before getting attacked by
> another fit of paranoia.  I'm going to need just a little bit more
> patient coaxing here:
>
> 1. Looking at Wikipedia, obviously the most reliable and stable place to
>   learn things, I see that UTF-8 has a payload of up to 21 bits.
> 2. Haskell Char are 32 bit
> 3. I presume that by 'malformed' String, you mean "things which cannot
>   be represented in 21 bits" or "not valid Unicode".  I'm sorry if my
>   language is sloppy here, but I hope what I'm saying makes sense
>
> So... how do we know that we will never get such Char?  Ian assures us
> that "You can only put valid unicode values in a Char".  But to what
> extent is that true?  For example, can we do something nasty and low
> level to inadvertently produce those chars?  Is there a scenario where
> this kind of thing may happen?  Let's say we're reading in filenames.
> What if for some reason or another, the filenames use some crazy
> encoding with lots of non-Unicode things?

(a few days late...)

>From http://darcs.haskell.org/libraries/base/GHC/Base.lhs :

chr :: Int -> Char
chr (I# i#) | int2Word# i# `leWord#` int2Word# 0x10FFFF# = C# (chr# i#)
            | otherwise                                  = error
"Prelude.chr: bad argument"

i.e. unless you import GHC.Prim (which provides the primitive C#
constructor) then you're pretty safe. I'm not sure what the Haskell
standard says about the implementation size of Char (does it _have_ to
be 32 bits?) but it is, AFAIR, specified to cover only the current
Unicode codepoint range 0-0x10FFFF.

Yes, you can probably do nasty things to create UTF8 sequences that
contain codepoints > 0x10FFFF (I'm just guessing here, but poking
bytes into the data part of a ByteString might do it), but in normal
operations, including reading filenames, Haskell Strings should be
quite safe. The filename scenario: the filename would have to be
marshaled first into a String. At this point, it should be sanitised.
If it has an exotic encoding, then whatever is doing the marshaling
must deal with it.

Alistair