[darcs-devel] DARCS for Windows international development
Stephen J. Turnbull
stephen at xemacs.org
Fri Jun 1 00:03:20 PDT 2007
Stefan O'Rear writes:
> On Thu, May 31, 2007 at 03:32:04PM -0700, Paul Schauble wrote:
> > Absent a BOM, is there another convention on Linux that allows you to
> > identify a UTF-8 file? Or does the program just have to know in advance
> > that it's reading UTF-8?
If *all* 8-bit characters come in groups, with the leading byte of the
form 11bbbbbb and the later ones of the form 10bbbbbb, you're probably
looking at UTF-8. (You can actually more precise; count the number of
leading 1s in the first byte, say N, and it will be followed by
exactly N-1 10bbbbbb-form bytes. The next byte will be either
0bbbbbbb or 11bbbbbb.)
> On Linux, all files are encoded in the character set specified by the
> locale environment variables; for instance LANG=en_US.UTF8 means to use
> utf8. Quick perusal of system documentation seems to show that "local
> charmap" prints the current encoding.
This is highly unreliable for many users. In fact it's very likely
that mbox files, saved HTML pages, and the like will be in various
encodings.
> Linux distributions are moving toward making UTF8 the default (and
> hopefully someday, the only option).
That will not happen any time soon, because there exist read-only
media in legacy encodings.
More information about the darcs-devel
mailing list