[darcs-users] petition for '\0' to be removed from binary auto-detection code

Ralph Corderoy ralph at inputplus.co.uk
Tue Nov 16 13:29:38 UTC 2004


Hi David,

> On Mon, Nov 15, 2004 at 06:06:21PM +0000, Mark Stosberg wrote:
> > While I like the idea of auto-detecting binary files, I realized
> > that '\0' (aka NUL) is not a good test. 
> > 
> > It it sometimes used (at least) in Perl to put a bunch of things
> > into a string that you may want to separate back out later. The
> > character is used precisely because it doesn't occur in text.
> > 
> > In particular, it's still used in the modern "CGI.pm" library, to
> > provide compatibility with the ancient 'cgi-lib.pl' library. 

That doesn't mean they have to have an ASCII NUL byte in their source,
e.g. "\x00" does just as well.

> Sounds like a reasonable argument to me.  The only trouble is that
> this pretty well guts the check for binary files, since we currently
> only check for '\0' and '\26' (EOF).  And I imagine that it is usually
> the '\0' check that correctly identifies binary files.

Yes.

Personally, I dislike the check since it'll be wrong some of the time
causing confusion when it occurs.  I'd prefer the user to have to always
specify.

Perl has a `text or binary' test with its -T and -B file operators.

    The "-T" and "-B" switches work as follows.  The first block or so
    of the file is examined for odd characters such as strange control
    codes or characters with the high bit set.  If too many strange
    characters (>30%) are found, it's a "-B" file, otherwise it's a "-T"
    file.  Also, any file containing null in the first block is
    considered a binary file.  If "-T" or "-B" is used on a filehandle,
    the current stdio buffer is examined rather than the first block.
    Both "-T" and "-B" return true on a null file, or a file at EOF when
    testing a filehandle.  Because you have to read a file to do the
    "-T" test, on most occasions you want to use a "-f" against the file
    first, as in "next unless -f $file && -T $file".

Note `any file containing null in the first block is considered a binary
file' includes the CGI.pm file that started this thread.  :-)

> Another option would be to add a set of regexps that indicate files
> that are *always* text.  This would be an ugly option, but might be
> used to keep \0 as a binary test, but special-case .pl files out of
> getting checked.

Please, no!  Besides, wouldn't a tar file containing a text file be
detected as text with this?

Cheers,


Ralph.





More information about the darcs-users mailing list