[darcs-users] Default binary masks
droundy at abridgegame.org
Mon Nov 24 01:04:11 UTC 2003
On Sun, Nov 23, 2003 at 02:58:14PM -0500, Sean E. Russell wrote:
> > files and creating binary patches if either is found. It probably
> > wouldn't be too hard to create rules for UTF-16 as well, but of course
> > that would require that we have a UTF-16 patch type. This wouldn't be
> > too hard to do,
> But are you suggesting automatically marking UTF-16 files as binary files?
No, that wouldn't be helpful. I'm suggesting that (post 1.0) if there is
demand, we could create a patch type describing a change to a UTF-16 text
file (or perhaps a container patch that converted a text-patch into a
> It is fairly safe to try to determine when a file is binary, but it is
> another thing entirely to try to divine both whether a file is binary AND
> what encoding it is using -- because there are binary files which are
> also valid UTF-16 (for example) files.
> The first question is whether it is possible for darcs to support
> non-ASCII file encodings as non-binary files. Will darcs be able to
> handle, for example, UTF-8 files as "text" files? If not, then
> supporting encodings is a moot issue, because there's only ISO-8859-1 and
> everything else is a binary file.
UTF-8 files are perfectly fine text files requiring no extra work to be
supported (except that darcs replace won't work quite as expected on UTF-8
files containing multibyte characters). UTF-16 require a bit more care
because (unless I'm mistaken) a simple '\n' is no longer the newline, since
each character takes two bytes. I don't know how exactly this is done, but
it definitely seems likely to require extra care in breaking it into lines.
We'd also want to be careful never to add a single character in the middle
of the file...
> Another question is whether darcs can change the "type" (text or binary)
> on the fly, if a "text" file suddenly starts containing characters
> associated with binary files.
This is no problem. In darcs it isn't the files that have a type, the
patches have a type (binary or text). Admittedly, the patch type is
usually determined by the file name...
> As an aside, 0xFFFE and 0xFEFF are legal characters for 8-bit ASCII files
> to start with, although I'd guess those files are pretty rare.
I thought ASCII was 7 bit?
> Subversion has a mechanism for plug-in diffing algorithms. I haven't
> seen it in use, yet, although I've toyed with the idea of writting an XML
> differ plugin for it. darcs probably doesn't want to go there.
> > Currently it is "based on regexps", which for the defaults means mostly
> > based on file ending, but could also be based on complete filenames.
> > In the general case, users can do whatever they want.
> But darcs isn't going to try to determine, from the file content, what
> encoding a file is using -- is that correct?
I don't see any reason not to try, as long as it can be overridden. In
fact, darcs already tries to determine if a file is binary by looking at
> > In some cases you can determine the encoding, provided you define the
> > encoding such that there are invalid files. For example, a valid text
> > file
> An ASCII encoded file can also be a valid UTF-16 file, a valid UTF-8 file
> (in fact, all 7-bit ASCII files are also valid UTF-8 files), a valid
> UNILE-encoded file, an a large number of valid ISO-8859-* files. Yes,
> there are a couple of characters which, if you're lucky enough that they
> occur in your file, can tell you that a file isn't an ASCII file, but
> you're asking for trouble if you try to auto-detect the encoding. The
> best you'll be able to do is *rule out* some encodings, because they are
> defined with "magic" byte sequences. UTF-16 and UNILE are good examples
> of this. But that, generally, only narrows the field of possible *valid*
> encodings by a small percentage.
> Worse, there are many multi-byte encodings which *can* include your
> "illegal" characters. If you start accepting multi-byte encodings as
> "text" files (UTF-16, Unile, Shift-JIS, UTF-8, etc.), it becomes
> impossible to tell the difference between binary data and a multi-byte
> encoded text file.
As long as the user can override darcs' guess, I don't see any reason not
to try to guess what the encoding is. As far as I can see, the only three
possibilities are 8 bit text (which includes UTF-8, since it has the same 8
bit newlines), UTF-16 and binary. I'm not likely to bother encoding a
UTF-16 guesser, since I have no UTF-16 files, but I don't see that it would
be a bad idea. Except that I don't really like introducing new patch
types, so I probably *would* require some convincing. But it would be the
new patch type that would be the sticking point, not trying to guess the
More information about the darcs-users