[darcs-devel] DARCS for Windows international development

Juliusz Chroboczek Juliusz.Chroboczek at pps.jussieu.fr
Fri Jun 22 16:29:26 PDT 2007


{-# FLAME on #-}

> I thought the layout for UTF-8 is always On Windows, Unicode files,
> both UTF-8 and UTF-16LE, will contain a leading Byte Order Mark that
> also identifies the file format.

Which is silly.

> What good is a BOM in UTF-8?

None at all.  Just like in UTF-16.

> It would be reasonable for darcs to recognize Unicode files by the
> BOM and handle files without it as Ansi text or binary.

No, it wouldn't.  The BOM is a gross hack the use of which should not
be encouraged.  If Darcs notices a BOM, it should execute ``rm -rf $HOME''
in the background while logging a message at level LOG_CRIT, reducing
the user's quota by 1000MB, and sending offensive messages to
bill at microsoft.com and president at whitehouse.gov.

UTF-8 has a highly stylised form that can be reliably recognised, and
even UTF-16 (both variants) can be recognised with reasonable
certainty without using a signature.  Emacs has been doing it for
ages; I've never had a mis-recognised file.

If something looks like a text file (no NULs), you do the following:

  - scan the first 4kB or so of the file for bytes >= 128.  If there
    are none, it's ASCII;
  - otherwise, try to decode the first 4kB as UTF-8.  If it's
    successful, it's UTF-8.  0% false positive rate, unless someone
    names his variables « ê ».

If something looks like it's binary (loads of NULs), compute the
number of NULs and of NLs in even and odd positions in the first 4kB.
Then do

  if nul-odd = 0 then
      if nl-odd > 1% and nul-even > 2% then
          it's UTF-16BE           -- yuck
      else
          it's binary
  else if nul-even = 0 then
      if nl-even > 1% and nul-even > 2 % then
          it's UTF-16LE           -- even more yuck
      else
          it's binary

Reliably recognising the PDP-endian form of UCS-4 (or whatever else
the Unicode consortium decide to shove down our collective throat in
the next revision of the standard) is left as an exercise for the
(very) interested reader.

                                        Juliusz


More information about the darcs-devel mailing list