Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Find illegal ASCII characters

by fmogavero (Monk)
on Mar 07, 2002 at 17:50 UTC ( #150070=sourcecode: print w/replies, xml ) Need Help??
Category: Utilities
Author/Contact Info fmogavero
fmogo@mninter.net
Description: This script will read a file byte by byte and send messages to the screen to signal bytes that are out of the ASCII text range. A client sent us a file with bad data and it totally hosed their data in the database.
use strict;

my $position = 0;

my $line = 1;

my $oldbyte = 0;

my $filesize = -s $ARGV[0];

my $byte;

my $oldbyte;

print "File size is $filesize bytes.\n";

open(INPUT,$ARGV[0]) || die "can't open $ARGV[0]:\n";

while ($position < ($filesize - 1)){

   read INPUT, $byte, 1, 0;

   my $val = ord $byte;

if ( $val == 10 ) {

    if ( $oldbyte ==1) {

        $line++;

        $oldbyte = -1;

    }

    $oldbyte++;

}

if ( ($val < 32 && $val != 10 || $val > 126) ) {

  print "Line $line byte value $val at offset $position is out of ASCI
+I text range!\n";

}

#print ord $byte,"\n";

$position++;

undef $byte;

seek(INPUT, $position, 0);

}

print "$line lines in file!\n";
Replies are listed 'Best First'.
•Re: Find illegal ASCII characters
by merlyn (Sage) on Mar 07, 2002 at 19:18 UTC
    #!/usr/bin/perl undef $/; while (<>) { print "File $ARGV has ", length(), " total length\n"; while (/([^\n\r\x20-\x7f])/g) { print "File $ARGV has character ", ord($1), " at byte ", pos()-1, +"\n"; } print "File $ARGV has ", tr/\n//, " total lines\n"; }

    -- Randal L. Schwartz, Perl hacker

       while (/([^\n\r\x20-\x7f])/g) {

      Why not to fix the expresion "once" to make the script faster?

        while (/([^\n\r\x20-\x7f])/go) {

      Saluti.

Re: Find illegal ASCII characters
by ww (Archbishop) on Mar 11, 2005 at 20:45 UTC
    Being nitpicky, but line 38 in original (quoted below) is NOT a general test for chars in 'ASCII text range':
    if ( ($val < 32 && $val != 10 || $val > 126) ) {
    as this misses some ASCII chars, such as 0x0d, that ARE in 'ASCII text range' at least as much as is 0x0a (10d).

    Applicability of this nitpick depends on circumstances. Windows relies, for example on 0x0d, 0x0a for CR, LF (\n).

Re: Find illegal ASCII characters
by fmogavero (Monk) on Mar 07, 2002 at 20:14 UTC
Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://150070]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (7)
As of 2020-09-21 19:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    If at first I don’t succeed, I …










    Results (127 votes). Check out past polls.

    Notices?