[darcs-users] Default binary masks
quension at mac.com
Tue Nov 25 07:56:13 UTC 2003
On Monday, Nov 24, 2003, at 19:43 US/Pacific, Sean E. Russell wrote:
> On Monday 24 November 2003 17:49, Trevor Talbot wrote:
>> UTF-16 is the most balanced of the Unicode formats to deal with.
> Why "most balanced"? It seems to combine the worst of all worlds:
> * Incompatability with 7-bit ASCII
This is a deliberate break. As a result, it does not have to go
through great encoding pains -- this equates to faster and simpler
> * Two byte orderings -- UTF-16 and UNILE; haven't we struggled enough
> with byte ordering problems?
Conceded, but at least the order is clearly marked :)
> * Overlapping start characters -- given a byte the middle of a file,
> you have to know two pieces of information to start scanning
> characters: the byte count of the byte (so you know if it is high or
> low) and the byte ordering (which is all the way at the start of the
> byte stream).
You generally process a file in 16bit units, so byte offsets never
enter the picture; you just need to know whether to swap the bytes
within the 16bit unit. As a result of the simple surrogate encoding,
you need to scan a maximum of 1 codepoint in either direction to
retrieve the full character you landed on, or a maximum of 3 codepoints
to retrieve the next or previous characters. Compare to UTF-8 scanning.
> * You get none of the advantages of a fixed-width encoding, yet
> everybody seems to think it is a fixed width encoding.
It is an extremely limited-width encoding (2), which does help
processing a great deal. Part of that fixed-width misconception
probably comes from Unicode's early days, when it was merely a 16bit
> Most of this equates to an inferiority to UTF-8 in terms of
> interoperability, which is why UTF-8 is more common than UTF-16 in
> transport protocols and data storage, and why XML defaults to UTF-8
When interoperating with byte-oriented tools, UTF-8 usually is the best
choice. UTF-16 is more convenient to process, but only if you can
break from said tools.
> I think this is starting to stray off-topic, though.
Perhaps this discussion will be of some use if Unicode processing
becomes an issue for darcs (such as for "replace").
More information about the darcs-users