[darcs-devel] DARCS for Windows international development

Wim Lewis wiml at hhhh.org
Thu May 31 16:07:54 PDT 2007


On May 31, 2007, at 3:32 PM, Paul Schauble wrote:
> UTF-8 files are only smaller if the text is English only.

(Nit: Other European languages also do well in UTF8 since they  
usually only have scattered non-ASCII characters. I agree with your  
point re Chinese, though!)

> Absent a BOM, is there another convention on Linux that allows you to
> identify a UTF-8 file? Or does the program just have to know in  
> advance
> that it's reading UTF-8?

The latter. The use of ZWNBS as a magic number for UTF-8 files is a  
Microsoftism, as far as I know. I don't think I've seen it on other  
platforms.

> I ask because the file reading routine I use examines the file for  
> a BOM
> and will interchangeably read Ansi in the system default code page,
> UTF-8, UTF-16LT, and UTF-16BE. I am considering using a similar method
> for darcs to identify the type of a file. But this only works if the
> Linux and Unix conventions call for a BOM on Unicode files.

The question of what kind of character encoding a file uses is kind  
of like the question of what language the contents are written in  
(Spanish, C++, etc.). One reason that UTF8 is popular is that many/ 
most utilities can remain agnostic about the character encoding of  
the text they're handling, without breaking anything.

In practice, if you know that something is either utf8 or utf16, it's  
easy to distinguish, but I dislike programs that run more complicated  
guessing algorithms over the text. Every now and then they get it  
wrong...




More information about the darcs-devel mailing list