[darcs-users] Default binary masks

David Roundy droundy at abridgegame.org
Mon Nov 24 01:04:11 UTC 2003


On Sun, Nov 23, 2003 at 02:58:14PM -0500, Sean E. Russell wrote:
> > files and creating binary patches if either is found.  It probably
> > wouldn't be too hard to create rules for UTF-16 as well, but of course
> > that would require that we have a UTF-16 patch type.  This wouldn't be
> > too hard to do,
> 
> But are you suggesting automatically marking UTF-16 files as binary files?

No, that wouldn't be helpful.  I'm suggesting that (post 1.0) if there is
demand, we could create a patch type describing a change to a UTF-16 text
file (or perhaps a container patch that converted a text-patch into a
UTF-16 patch).

> It is fairly safe to try to determine when a file is binary, but it is
> another thing entirely to try to divine both whether a file is binary AND
> what encoding it is using -- because there are binary files which are
> also valid UTF-16 (for example) files.
> 
> The first question is whether it is possible for darcs to support
> non-ASCII file encodings as non-binary files.  Will darcs be able to
> handle, for example, UTF-8 files as "text" files?  If not, then
> supporting encodings is a moot issue, because there's only ISO-8859-1 and
> everything else is a binary file.

UTF-8 files are perfectly fine text files requiring no extra work to be
supported (except that darcs replace won't work quite as expected on UTF-8
files containing multibyte characters).  UTF-16 require a bit more care
because (unless I'm mistaken) a simple '\n' is no longer the newline, since
each character takes two bytes.  I don't know how exactly this is done, but
it definitely seems likely to require extra care in breaking it into lines.
We'd also want to be careful never to add a single character in the middle
of the file...

> Another question is whether darcs can change the "type" (text or binary)
> on the fly, if a "text" file suddenly starts containing characters
> associated with binary files.

This is no problem.  In darcs it isn't the files that have a type, the
patches have a type (binary or text).  Admittedly, the patch type is
usually determined by the file name...

> As an aside, 0xFFFE and 0xFEFF are legal characters for 8-bit ASCII files
> to start with, although I'd guess those files are pretty rare.

I thought ASCII was 7 bit?

> Subversion has a mechanism for plug-in diffing algorithms.  I haven't
> seen it in use, yet, although I've toyed with the idea of writting an XML
> differ plugin for it.  darcs probably doesn't want to go there.
> 
> > Currently it is "based on regexps", which for the defaults means mostly
> > based on file ending, but could also be based on complete filenames.
> > In the general case, users can do whatever they want.
> 
> But darcs isn't going to try to determine, from the file content, what 
> encoding a file is using -- is that correct?

I don't see any reason not to try, as long as it can be overridden.  In
fact, darcs already tries to determine if a file is binary by looking at
its content.

> > In some cases you can determine the encoding, provided you define the
> > encoding such that there are invalid files.  For example, a valid text
> > file
> 
> An ASCII encoded file can also be a valid UTF-16 file, a valid UTF-8 file
> (in fact, all 7-bit ASCII files are also valid UTF-8 files), a valid
> UNILE-encoded file, an a large number of valid ISO-8859-* files.  Yes,
> there are a couple of characters which, if you're lucky enough that they
> occur in your file, can tell you that a file isn't an ASCII file, but
> you're asking for trouble if you try to auto-detect the encoding.  The
> best you'll be able to do is *rule out* some encodings, because they are
> defined with "magic" byte sequences.  UTF-16 and UNILE are good examples
> of this.  But that, generally, only narrows the field of possible *valid*
> encodings by a small percentage.
> 
> Worse, there are many multi-byte encodings which *can* include your
> "illegal" characters.  If you start accepting multi-byte encodings as
> "text" files (UTF-16, Unile, Shift-JIS, UTF-8, etc.), it becomes
> impossible to tell the difference between binary data and a multi-byte
> encoded text file.

As long as the user can override darcs' guess, I don't see any reason not
to try to guess what the encoding is.  As far as I can see, the only three
possibilities are 8 bit text (which includes UTF-8, since it has the same 8
bit newlines), UTF-16 and binary.  I'm not likely to bother encoding a
UTF-16 guesser, since I have no UTF-16 files, but I don't see that it would
be a bad idea.  Except that I don't really like introducing new patch
types, so I probably *would* require some convincing.  But it would be the
new patch type that would be the sticking point, not trying to guess the
file encoding.
-- 
David Roundy
http://www.abridgegame.org




More information about the darcs-users mailing list