Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: Perl -T vs Mime::Types

by afoken (Chancellor)
on Sep 19, 2017 at 19:38 UTC ( [id://1199694]=note: print w/replies, xml ) Need Help??


in reply to Perl -T vs Mime::Types

-T tests just a few bytes of the file (see -X). File::Type just guesses a file type by searching for a few magic numbers, like file. Both can't be reliable.

If you want to check for a file that contains only ASCII characters, you have to check the entire file. There is no other way.

I guess you also want to check for a sane file size, perhaps some hundred kBytes or a few MBytes. On a modern computer, slurping the entire file with that limitation is no big problem.

You may want something like this (untested):

-f $filename or die "$filename is not a file"; (-s _ < 100_000) or die "$filename is too large"; # avoid a second sta +t() syscall by using the special handle "_" my $blob=do { open my $f,'<:raw',$filename or die "Can't open $filename: $!"; local $/; # slurp mode <$f>; # slurp # leaving the do block auto-closes $f }; # Accept only CR, LF, TAB, and printable characters from 0x20 to 0x7E. $blob=~/^[\r\n\t\x20-\x7E]*$/s or die "$filename is not ASCII";

If you want significantly larger files, you have to read smaller blocks (perhaps 1 MByte each), and check each block for its "ASCIIness". Abort at the first failed block.

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Replies are listed 'Best First'.
Re^2: Perl -T vs Mime::Types
by AnomalousMonk (Archbishop) on Sep 20, 2017 at 00:43 UTC

    tr/// may be a bit faster than s///, so maybe (also untested)
        $blob =~ tr/\r\n\t\x20-\x7E//c or die "$filename is not ASCII";
    (See perlop Quote-Like Operators for  tr/// and its  /c (complement) modifier.)

    Update: Correction: The logical operator should be  and because we wish an exception to be thrown if any "non-ASCII" character is found, i.e., if the  tr///c count is non-zero:
        $blob =~ tr/\r\n\t\x20-\x7E//c and die "$filename is not ASCII";


    Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1199694]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (5)
As of 2024-04-19 13:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found