[darcs-users] Default binary masks

David Roundy droundy at abridgegame.org
Mon Nov 24 01:04:11 UTC 2003

On Sun, Nov 23, 2003 at 02:58:14PM -0500, Sean E. Russell wrote:
> > files and creating binary patches if either is found.  It probably
> > wouldn't be too hard to create rules for UTF-16 as well, but of course
> > that would require that we have a UTF-16 patch type.  This wouldn't be
> > too hard to do,
> But are you suggesting automatically marking UTF-16 files as binary files?

No, that wouldn't be helpful.  I'm suggesting that (post 1.0) if there is
demand, we could create a patch type describing a change to a UTF-16 text
file (or perhaps a container patch that converted a text-patch into a
UTF-16 patch).

> It is fairly safe to try to determine when a file is binary, but it is
> another thing entirely to try to divine both whether a file is binary AND
> what encoding it is using -- because there are binary files which are
> also valid UTF-16 (for example) files.
> The first question is whether it is possible for darcs to support
> non-ASCII file encodings as non-binary files.  Will darcs be able to
> handle, for example, UTF-8 files as "text" files?  If not, then
> supporting encodings is a moot issue, because there's only ISO-8859-1 and
> everything else is a binary file.

UTF-8 files are perfectly fine text files requiring no extra work to be
supported (except that darcs replace won't work quite as expected on UTF-8
files containing multibyte characters).  UTF-16 require a bit more care
because (unless I'm mistaken) a simple '\n' is no longer the newline, since
each character takes two bytes.  I don't know how exactly this is done, but
it definitely seems likely to require extra care in breaking it into lines.
We'd also want to be careful never to add a single character in the middle
of the file...

> Another question is whether darcs can change the "type" (text or binary)
> on the fly, if a "text" file suddenly starts containing characters
> associated with binary files.

This is no problem.  In darcs it isn't the files that have a type, the
patches have a type (binary or text).  Admittedly, the patch type is
usually determined by the file name...

> As an aside, 0xFFFE and 0xFEFF are legal characters for 8-bit ASCII files
> to start with, although I'd guess those files are pretty rare.

I thought ASCII was 7 bit?

> Subversion has a mechanism for plug-in diffing algorithms.  I haven't
> seen it in use, yet, although I've toyed with the idea of writting an XML
> differ plugin for it.  darcs probably doesn't want to go there.
> > Currently it is "based on regexps", which for the defaults means mostly
> > based on file ending, but could also be based on complete filenames.
> > In the general case, users can do whatever they want.
> But darcs isn't going to try to determine, from the file content, what 
> encoding a file is using -- is that correct?

I don't see any reason not to try, as long as it can be overridden.  In
fact, darcs already tries to determine if a file is binary by looking at
its content.

> > In some cases you can determine the encoding, provided you define the
> > encoding such that there are invalid files.  For example, a valid text
> > file
> An ASCII encoded file can also be a valid UTF-16 file, a valid UTF-8 file
> (in fact, all 7-bit ASCII files are also valid UTF-8 files), a valid
> UNILE-encoded file, an a large number of valid ISO-8859-* files.  Yes,
> there are a couple of characters which, if you're lucky enough that they
> occur in your file, can tell you that a file isn't an ASCII file, but
> you're asking for trouble if you try to auto-detect the encoding.  The
> best you'll be able to do is *rule out* some encodings, because they are
> defined with "magic" byte sequences.  UTF-16 and UNILE are good examples
> of this.  But that, generally, only narrows the field of possible *valid*
> encodings by a small percentage.
> Worse, there are many multi-byte encodings which *can* include your
> "illegal" characters.  If you start accepting multi-byte encodings as
> "text" files (UTF-16, Unile, Shift-JIS, UTF-8, etc.), it becomes
> impossible to tell the difference between binary data and a multi-byte
> encoded text file.

As long as the user can override darcs' guess, I don't see any reason not
to try to guess what the encoding is.  As far as I can see, the only three
possibilities are 8 bit text (which includes UTF-8, since it has the same 8
bit newlines), UTF-16 and binary.  I'm not likely to bother encoding a
UTF-16 guesser, since I have no UTF-16 files, but I don't see that it would
be a bad idea.  Except that I don't really like introducing new patch
types, so I probably *would* require some convincing.  But it would be the
new patch type that would be the sticking point, not trying to guess the
file encoding.
David Roundy

More information about the darcs-users mailing list