[darcs-users] Default binary masks

Sean E. Russell ser at germane-software.com
Sun Nov 23 19:58:14 UTC 2003


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

[Sorry about losing the thread.  I hit the wrong key and deleted the message, 
and had to recover the email from the ML archives]

> files and creating binary patches if either is found.  It probably wouldn't
> be too hard to create rules for UTF-16 as well, but of course that would
> require that we have a UTF-16 patch type.  This wouldn't be too hard to do,

But are you suggesting automatically marking UTF-16 files as binary files?

It is fairly safe to try to determine when a file is binary, but it is another 
thing entirely to try to divine both whether a file is binary AND what 
encoding it is using -- because there are binary files which are also valid 
UTF-16 (for example) files.

The first question is whether it is possible for darcs to support non-ASCII 
file encodings as non-binary files.  Will darcs be able to handle, for 
example, UTF-8 files as "text" files?  If not, then supporting encodings is a 
moot issue, because there's only ISO-8859-1 and everything else is a binary 
file.

Another question is whether darcs can change the "type" (text or binary) on 
the fly, if a "text" file suddenly starts containing characters associated 
with binary files.

As an aside, 0xFFFE and 0xFEFF are legal characters for 8-bit ASCII files to 
start with, although I'd guess those files are pretty rare.

Subversion has a mechanism for plug-in diffing algorithms.  I haven't seen it 
in use, yet, although I've toyed with the idea of writting an XML differ 
plugin for it.  darcs probably doesn't want to go there.

> Currently it is "based on regexps", which for the defaults means mostly
> based on file ending, but could also be based on complete filenames.  In
> the general case, users can do whatever they want.

But darcs isn't going to try to determine, from the file content, what 
encoding a file is using -- is that correct?

> In some cases you can determine the encoding, provided you define the
> encoding such that there are invalid files.  For example, a valid text file

An ASCII encoded file can also be a valid UTF-16 file, a valid UTF-8 file (in 
fact, all 7-bit ASCII files are also valid UTF-8 files), a valid 
UNILE-encoded file, an a large number of valid ISO-8859-* files.  Yes, there 
are a couple of characters which, if you're lucky enough that they occur in 
your file, can tell you that a file isn't an ASCII file, but you're asking 
for trouble if you try to auto-detect the encoding.  The best you'll be able 
to do is *rule out* some encodings, because they are defined with "magic" 
byte sequences.  UTF-16 and UNILE are good examples of this.  But that, 
generally, only narrows the field of possible *valid* encodings by a small 
percentage.

Worse, there are many multi-byte encodings which *can* include your "illegal" 
characters.  If you start accepting multi-byte encodings as "text" files 
(UTF-16, Unile, Shift-JIS, UTF-8, etc.), it becomes impossible to tell the 
difference between binary data and a multi-byte encoded text file.

- -- 
### SER   
### Deutsch|Esperanto|Francaise|Linux|XML|Java|Ruby|Aikido|Dirigibles
### http://www.germane-software.com/~ser  jabber.com:ser  ICQ:83578737 
### GPG: http://www.germane-software.com/~ser/Security/ser_public.gpg
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE/wRFbP0KxygnleI8RAk1iAKDP0AIs9nGsf7BE2zoYManlRKFkQQCgjlmN
9uLOxT+r0NfNVPisNka0Bl4=
=eEtZ
-----END PGP SIGNATURE-----





More information about the darcs-users mailing list