[darcs-devel] DARCS for Windows international development
Samuel A. Falvo II
sam.falvo at gmail.com
Thu May 31 15:52:50 PDT 2007
On 5/31/07, Paul Schauble <Paul.Schauble at ticketmaster.com> wrote:
> working on a project in Simplified Chinese. Most Chinese characters are
> 3 bytes in UTF-8, some are 4. In the so-called "modified UTF-8" that
> Java uses, some are 6.
True, but most of the languages on this planet use the Latin character
set, which _most_ seem to be covered by plain ANSI (codes 32-126).
Hence, a more compact representation is had by a larger group of
languages.
Also, "most" programs use UTF-8. Some don't -- IIRC, the "sam"
editor uses raw UTF-16 for its output files, while GCC compiles code
such that Unicode characters are 32-bits wide.
> Absent a BOM, is there another convention on Linux that allows you to
> identify a UTF-8 file? Or does the program just have to know in advance
> that it's reading UTF-8?
At the end of the day, you just have to know. However, most
UTF-encoded files will have ".utf" extension instead of ".txt."
--
Samuel A. Falvo II
More information about the darcs-devel
mailing list