[darcs-users] Default binary masks

Sean E. Russell ser at germane-software.com
Mon Nov 24 02:48:30 UTC 2003


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sunday 23 November 2003 20:04, David Roundy wrote:
> > But are you suggesting automatically marking UTF-16 files as binary
> > files?
>
> No, that wouldn't be helpful.  I'm suggesting that (post 1.0) if there is
> demand, we could create a patch type describing a change to a UTF-16 text
> file (or perhaps a container patch that converted a text-patch into a
> UTF-16 patch).

If you support multibyte encodings, you no longer have the ability to 
automagically determine binaryness.  I'm not even sure whether most 
non-european encodings have the same meaning for the bytes 0x00 and 0x1a, so 
you may have trouble with some single-byte encodings, as well.

> UTF-8 files are perfectly fine text files requiring no extra work to be
> supported (except that darcs replace won't work quite as expected on UTF-8

UCS-2 and UCS-4 (both Unicode encodings) contain \0 as the first byte of the 
file, and would be marked as binary by darcs.  darcs is, as you say, safe 
from UTF-8.

> We'd also want to be careful never to add a single character in the middle
> of the file...

I'm more concerned about how you're going to tell the difference between a 
binary file and a text file, and how much it matters.

> This is no problem.  In darcs it isn't the files that have a type, the
> patches have a type (binary or text).  Admittedly, the patch type is
> usually determined by the file name...

... but also sometimes by magic byte sequences.

> > As an aside, 0xFFFE and 0xFEFF are legal characters for 8-bit ASCII files
> > to start with, although I'd guess those files are pretty rare.
>
> I thought ASCII was 7 bit?

There's 7-bit ASCII, and 8-bit ASCII.  8-bit ASCII is known as Extended ASCII, 
but it is the encoding that is supported by your Linux box and these emails.  
Actually, the common encoding in the USA is ISO-8859-1, which most people 
equate with ASCII.  Most europeans use an ISO-8859-*, and make liberal use of 
the extended characters.  ASCII isn't really the issue, and I've been wrong 
to throw it around as much as I have.  What we're really talking about -- the 
common case for darcs under Linux -- is one of the ISO-8859-* encodings, all 
of which are 8-bit, and most (all?) of which share the same encodings for the 
characters in the first 7 bits.

> > But darcs isn't going to try to determine, from the file content, what
> > encoding a file is using -- is that correct?
>
> I don't see any reason not to try, as long as it can be overridden.  In
> fact, darcs already tries to determine if a file is binary by looking at
> its content.

Yeah.  The point I'm trying to get across is that I don't believe you can do 
both.  If you assume you're working with text patchs, you can make a 
reasonable guess at the encoding -- although this is not reliable.  If you 
ignore encodings (well, assume a single 8-bit encoding) you make a reasonable 
guess about whether a file is binary or text.  But you can't reliably 
determine that a file is text AND determine its encoding.

> to try to guess what the encoding is.  As far as I can see, the only three
> possibilities are 8 bit text (which includes UTF-8, since it has the same 8

UTF-8 maps 1:1 to 7-bit ASCII, not to 8-bit ASCII or to ISO-8859-*.  After the 
first 127 characters, the rest of the 2^31 characters are multi-byte 
sequences.

> bit newlines), UTF-16 and binary.  I'm not likely to bother encoding a
> UTF-16 guesser, since I have no UTF-16 files, but I don't see that it would
> be a bad idea.  Except that I don't really like introducing new patch

If you're going to support UTF-16, you'd better worry about Shift-JIS and EUC.  
Both are much more common in Japan than either ISO-8859-1 or UTF-16 -- or 
UTF-8, for that matter.  Unless, of course, you don't care about darcs in 
Japan :-)

Detecting encodings is more difficult than it sounds, which is why HTML and 
XML is careful to include encoding metadata in the header.

- -- 
### SER   
### Deutsch|Esperanto|Francaise|Linux|XML|Java|Ruby|Aikido|Dirigibles
### http://www.germane-software.com/~ser  jabber.com:ser  ICQ:83578737 
### GPG: http://www.germane-software.com/~ser/Security/ser_public.gpg
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE/wXGEP0KxygnleI8RAjI7AJ9Rxceci4vQrbl8xQMGy1GgPJwhsACdGGWi
REK3JW+EkjfJC3XRCxjlzyg=
=KCLZ
-----END PGP SIGNATURE-----





More information about the darcs-users mailing list