[darcs-users] Default binary masks
Sean E. Russell
ser at germane-software.com
Mon Nov 24 02:48:30 UTC 2003
-----BEGIN PGP SIGNED MESSAGE-----
On Sunday 23 November 2003 20:04, David Roundy wrote:
> > But are you suggesting automatically marking UTF-16 files as binary
> > files?
> No, that wouldn't be helpful. I'm suggesting that (post 1.0) if there is
> demand, we could create a patch type describing a change to a UTF-16 text
> file (or perhaps a container patch that converted a text-patch into a
> UTF-16 patch).
If you support multibyte encodings, you no longer have the ability to
automagically determine binaryness. I'm not even sure whether most
non-european encodings have the same meaning for the bytes 0x00 and 0x1a, so
you may have trouble with some single-byte encodings, as well.
> UTF-8 files are perfectly fine text files requiring no extra work to be
> supported (except that darcs replace won't work quite as expected on UTF-8
UCS-2 and UCS-4 (both Unicode encodings) contain \0 as the first byte of the
file, and would be marked as binary by darcs. darcs is, as you say, safe
> We'd also want to be careful never to add a single character in the middle
> of the file...
I'm more concerned about how you're going to tell the difference between a
binary file and a text file, and how much it matters.
> This is no problem. In darcs it isn't the files that have a type, the
> patches have a type (binary or text). Admittedly, the patch type is
> usually determined by the file name...
... but also sometimes by magic byte sequences.
> > As an aside, 0xFFFE and 0xFEFF are legal characters for 8-bit ASCII files
> > to start with, although I'd guess those files are pretty rare.
> I thought ASCII was 7 bit?
There's 7-bit ASCII, and 8-bit ASCII. 8-bit ASCII is known as Extended ASCII,
but it is the encoding that is supported by your Linux box and these emails.
Actually, the common encoding in the USA is ISO-8859-1, which most people
equate with ASCII. Most europeans use an ISO-8859-*, and make liberal use of
the extended characters. ASCII isn't really the issue, and I've been wrong
to throw it around as much as I have. What we're really talking about -- the
common case for darcs under Linux -- is one of the ISO-8859-* encodings, all
of which are 8-bit, and most (all?) of which share the same encodings for the
characters in the first 7 bits.
> > But darcs isn't going to try to determine, from the file content, what
> > encoding a file is using -- is that correct?
> I don't see any reason not to try, as long as it can be overridden. In
> fact, darcs already tries to determine if a file is binary by looking at
> its content.
Yeah. The point I'm trying to get across is that I don't believe you can do
both. If you assume you're working with text patchs, you can make a
reasonable guess at the encoding -- although this is not reliable. If you
ignore encodings (well, assume a single 8-bit encoding) you make a reasonable
guess about whether a file is binary or text. But you can't reliably
determine that a file is text AND determine its encoding.
> to try to guess what the encoding is. As far as I can see, the only three
> possibilities are 8 bit text (which includes UTF-8, since it has the same 8
UTF-8 maps 1:1 to 7-bit ASCII, not to 8-bit ASCII or to ISO-8859-*. After the
first 127 characters, the rest of the 2^31 characters are multi-byte
> bit newlines), UTF-16 and binary. I'm not likely to bother encoding a
> UTF-16 guesser, since I have no UTF-16 files, but I don't see that it would
> be a bad idea. Except that I don't really like introducing new patch
If you're going to support UTF-16, you'd better worry about Shift-JIS and EUC.
Both are much more common in Japan than either ISO-8859-1 or UTF-16 -- or
UTF-8, for that matter. Unless, of course, you don't care about darcs in
Detecting encodings is more difficult than it sounds, which is why HTML and
XML is careful to include encoding metadata in the header.
### http://www.germane-software.com/~ser jabber.com:ser ICQ:83578737
### GPG: http://www.germane-software.com/~ser/Security/ser_public.gpg
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)
-----END PGP SIGNATURE-----
More information about the darcs-users