[darcs-devel] DARCS for Windows international development

Samuel A. Falvo II sam.falvo at gmail.com
Thu May 31 15:52:50 PDT 2007


On 5/31/07, Paul Schauble <Paul.Schauble at ticketmaster.com> wrote:
> working on a project in Simplified Chinese. Most Chinese characters are
> 3 bytes in UTF-8, some are 4. In the so-called "modified UTF-8" that
> Java uses, some are 6.

True, but most of the languages on this planet use the Latin character
set, which _most_ seem to be covered by plain ANSI (codes 32-126).
Hence, a more compact representation is had by a larger group of
languages.

Also, "most" programs use UTF-8.  Some don't  --  IIRC, the "sam"
editor uses raw UTF-16 for its output files, while GCC compiles code
such that Unicode characters are 32-bits wide.

> Absent a BOM, is there another convention on Linux that allows you to
> identify a UTF-8 file? Or does the program just have to know in advance
> that it's reading UTF-8?

At the end of the day, you just have to know.  However, most
UTF-encoded files will have ".utf" extension instead of ".txt."

-- 
Samuel A. Falvo II


More information about the darcs-devel mailing list