[darcs-users] Default binary masks

Alex Shinn foof at synthcode.com
Tue Nov 25 04:01:09 UTC 2003


At Sun, 23 Nov 2003 20:04:11 -0500, David Roundy wrote:
> 
> As long as the user can override darcs' guess, I don't see any reason not
> to try to guess what the encoding is.  As far as I can see, the only three
> possibilities are 8 bit text (which includes UTF-8, since it has the same 8
> bit newlines), UTF-16 and binary.  I'm not likely to bother encoding a
> UTF-16 guesser, since I have no UTF-16 files, but I don't see that it would
> be a bad idea.  Except that I don't really like introducing new patch
> types, so I probably *would* require some convincing.  But it would be the
> new patch type that would be the sticking point, not trying to guess the
> file encoding.

You can group encodings into a few general types:

  1) ASCII backwards compatible
     * ASCII characters in the encoding only ever represent ASCII
       characters, and other characters are combinations of one or more
       bytes in the 0x80-0xFF range.  This includes all single byte
       scripts such as ISO-8859-*, plus UTF-8 and EUC-*.

  2) ASCII "friendly"
     * A pure ASCII string is rendered as itself in the encoding, and
       extended characters contain neither the NULL byte nor ASCII bytes
       used in the diffing algorithm such as newline.  This includes
       Shift_JIS.

  3) Non-NULL
     * Only guaranteed not to contain the NULL byte.  Includes
       ISO-2022-* 7bit encodings and UTF-7.

  4) NULL-encoded text
     * UTF-16{le,be} and UTF-32{le,be}.

  5) Binary Data

1&2 can be handled as is with any ASCII-oriented diff algorithm.  3 can
be handled as is, but may suffer poor performance if newlines are not
represented as themselves.  3 also has a slight problem of newline
conversions between different platforms, but this is a tricky area
regardless.  Ideally the diff algorithm should preserve exact byte
differences across all platforms (such as by auto-detecting the NL
convention), so this should not be an issue.  That is, if I check in a
newline delimited file from a Unix system, someone who checks it out and
applies patches on a Mac should not have it either converted to or
patched with carriage returns.

1-3 are easily identified by the absence of 0 in the file.  4 is really
only different from 3 in that we can't detect it as easily.  In UTF-16
newlines will be either 0x00,0x0A or 0x0A,0x00 and splitting on newline
will halve the character but re-unite it when patched back together.  So
long as the diff algorithm remains content agnostic all it really needs
is that we can reliably chunk and re-combine the text.

4 is very often detectable by the 0xFFFE/0xFEFF at the start of the
file, and all editors I know insert that magic.  Seeing this we can fall
back on the same algorithm for 1-3.  If we guessed wrong and it's really
a single-byte 8bit encoding then there's no difference, and if it's
really a binary file we suffer only poor performance on diffing.  Emacs
uses this approach and auto-detects all common encodings just fine.

-- 
Alex




More information about the darcs-users mailing list