[darcs-users] Fwd: UTF8 patch
Eric Kow
kowey at darcs.net
Mon Dec 1 08:57:34 UTC 2008
On Sun, Nov 30, 2008 at 16:34:43 -0500, Gwern Branwen wrote:
> But as I said before, so far as I know,
> we shouldn't have to care about possibly erroneous strings because the
> only function in UTF8.lhs is 'encode :: String -> [Word8]', and the
> Haskell runtime will never give us a malformed String. (If we do get a
> bad string, certainly nothing encode can do will make the situation
> worse.)
So, I was just about to apply this last night before getting attacked by
another fit of paranoia. I'm going to need just a little bit more
patient coaxing here:
1. Looking at Wikipedia, obviously the most reliable and stable place to
learn things, I see that UTF-8 has a payload of up to 21 bits.
2. Haskell Char are 32 bit
3. I presume that by 'malformed' String, you mean "things which cannot
be represented in 21 bits" or "not valid Unicode". I'm sorry if my
language is sloppy here, but I hope what I'm saying makes sense
So... how do we know that we will never get such Char? Ian assures us
that "You can only put valid unicode values in a Char". But to what
extent is that true? For example, can we do something nasty and low
level to inadvertently produce those chars? Is there a scenario where
this kind of thing may happen? Let's say we're reading in filenames.
What if for some reason or another, the filenames use some crazy
encoding with lots of non-Unicode things?
That said, looking at
http://hackage.haskell.org/packages/archive/utf8-string/0.3.3/doc/html/src/Codec-Binary-UTF8-String.html#encode
vs.
http://darcs.net/api-doc/src-UTF8.html
it does seem that UTF8's encode function is more stringent than
utf8-string's. So it seems that the consquences of having a
crazy character are that we would have gotten an error along the
lines of
encodeUTF8: ord returned a value above 0x10FFFF
anyway... and since that hasn't happened in the past to my knowledge,
we can probably relax (although try and think about having the same
behaviour or introducing the same behaviour into the utf8-string
package).
Anyway, that's the current state of my flip-flop. I'll probably
apply it tomorrow unless somebody shouts.
--
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
PGP Key ID: 08AC04F9
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
Url : http://lists.osuosl.org/pipermail/darcs-users/attachments/20081201/05aa39b7/attachment-0001.pgp
More information about the darcs-users
mailing list