[darcs-users] Default binary masks

Trevor Talbot quension at mac.com
Tue Nov 25 07:56:13 UTC 2003


On Monday, Nov 24, 2003, at 19:43 US/Pacific, Sean E. Russell wrote:

> On Monday 24 November 2003 17:49, Trevor Talbot wrote:

>> UTF-16 is the most balanced of the Unicode formats to deal with.
>
> Why "most balanced"?  It seems to combine the worst of all worlds:
>
> * Incompatability with 7-bit ASCII

This is a deliberate break.  As a result, it does not have to go 
through great encoding pains -- this equates to faster and simpler 
processing.

> * Two byte orderings -- UTF-16 and UNILE; haven't we struggled enough 
> with byte ordering problems?

Conceded, but at least the order is clearly marked :)

> * Overlapping start characters -- given a byte the middle of a file, 
> you have to know two pieces of information to start scanning 
> characters: the byte count of the byte (so you know if it is high or 
> low) and the byte ordering (which is all the way at the start of the 
> byte stream).

You generally process a file in 16bit units, so byte offsets never 
enter the picture; you just need to know whether to swap the bytes 
within the 16bit unit.  As a result of the simple surrogate encoding, 
you need to scan a maximum of 1 codepoint in either direction to 
retrieve the full character you landed on, or a maximum of 3 codepoints 
to retrieve the next or previous characters.  Compare to UTF-8 scanning.

> * You get none of the advantages of a fixed-width encoding, yet 
> everybody seems to think it is a fixed width encoding.

It is an extremely limited-width encoding (2), which does help 
processing a great deal.  Part of that fixed-width misconception 
probably comes from Unicode's early days, when it was merely a 16bit 
character set.

> Most of this equates to an inferiority to UTF-8 in terms of 
> interoperability, which is why UTF-8 is more common than UTF-16 in 
> transport protocols and data storage, and why XML defaults to UTF-8 
> encoding.

When interoperating with byte-oriented tools, UTF-8 usually is the best 
choice.  UTF-16 is more convenient to process, but only if you can 
break from said tools.

> I think this is starting to stray off-topic, though.

Perhaps this discussion will be of some use if Unicode processing 
becomes an issue for darcs (such as for "replace").





More information about the darcs-users mailing list