[darcs-devel] DARCS for Windows international development
Wim Lewis
wiml at hhhh.org
Thu May 31 16:07:54 PDT 2007
On May 31, 2007, at 3:32 PM, Paul Schauble wrote:
> UTF-8 files are only smaller if the text is English only.
(Nit: Other European languages also do well in UTF8 since they
usually only have scattered non-ASCII characters. I agree with your
point re Chinese, though!)
> Absent a BOM, is there another convention on Linux that allows you to
> identify a UTF-8 file? Or does the program just have to know in
> advance
> that it's reading UTF-8?
The latter. The use of ZWNBS as a magic number for UTF-8 files is a
Microsoftism, as far as I know. I don't think I've seen it on other
platforms.
> I ask because the file reading routine I use examines the file for
> a BOM
> and will interchangeably read Ansi in the system default code page,
> UTF-8, UTF-16LT, and UTF-16BE. I am considering using a similar method
> for darcs to identify the type of a file. But this only works if the
> Linux and Unix conventions call for a BOM on Unicode files.
The question of what kind of character encoding a file uses is kind
of like the question of what language the contents are written in
(Spanish, C++, etc.). One reason that UTF8 is popular is that many/
most utilities can remain agnostic about the character encoding of
the text they're handling, without breaking anything.
In practice, if you know that something is either utf8 or utf16, it's
easy to distinguish, but I dislike programs that run more complicated
guessing algorithms over the text. Every now and then they get it
wrong...
More information about the darcs-devel
mailing list