[darcs-users] Default binary masks
Sean E. Russell
ser at germane-software.com
Sun Nov 23 19:58:14 UTC 2003
-----BEGIN PGP SIGNED MESSAGE-----
[Sorry about losing the thread. I hit the wrong key and deleted the message,
and had to recover the email from the ML archives]
> files and creating binary patches if either is found. It probably wouldn't
> be too hard to create rules for UTF-16 as well, but of course that would
> require that we have a UTF-16 patch type. This wouldn't be too hard to do,
But are you suggesting automatically marking UTF-16 files as binary files?
It is fairly safe to try to determine when a file is binary, but it is another
thing entirely to try to divine both whether a file is binary AND what
encoding it is using -- because there are binary files which are also valid
UTF-16 (for example) files.
The first question is whether it is possible for darcs to support non-ASCII
file encodings as non-binary files. Will darcs be able to handle, for
example, UTF-8 files as "text" files? If not, then supporting encodings is a
moot issue, because there's only ISO-8859-1 and everything else is a binary
Another question is whether darcs can change the "type" (text or binary) on
the fly, if a "text" file suddenly starts containing characters associated
with binary files.
As an aside, 0xFFFE and 0xFEFF are legal characters for 8-bit ASCII files to
start with, although I'd guess those files are pretty rare.
Subversion has a mechanism for plug-in diffing algorithms. I haven't seen it
in use, yet, although I've toyed with the idea of writting an XML differ
plugin for it. darcs probably doesn't want to go there.
> Currently it is "based on regexps", which for the defaults means mostly
> based on file ending, but could also be based on complete filenames. In
> the general case, users can do whatever they want.
But darcs isn't going to try to determine, from the file content, what
encoding a file is using -- is that correct?
> In some cases you can determine the encoding, provided you define the
> encoding such that there are invalid files. For example, a valid text file
An ASCII encoded file can also be a valid UTF-16 file, a valid UTF-8 file (in
fact, all 7-bit ASCII files are also valid UTF-8 files), a valid
UNILE-encoded file, an a large number of valid ISO-8859-* files. Yes, there
are a couple of characters which, if you're lucky enough that they occur in
your file, can tell you that a file isn't an ASCII file, but you're asking
for trouble if you try to auto-detect the encoding. The best you'll be able
to do is *rule out* some encodings, because they are defined with "magic"
byte sequences. UTF-16 and UNILE are good examples of this. But that,
generally, only narrows the field of possible *valid* encodings by a small
Worse, there are many multi-byte encodings which *can* include your "illegal"
characters. If you start accepting multi-byte encodings as "text" files
(UTF-16, Unile, Shift-JIS, UTF-8, etc.), it becomes impossible to tell the
difference between binary data and a multi-byte encoded text file.
### http://www.germane-software.com/~ser jabber.com:ser ICQ:83578737
### GPG: http://www.germane-software.com/~ser/Security/ser_public.gpg
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)
-----END PGP SIGNATURE-----
More information about the darcs-users