Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Rogue Null (ordinate 0) characters in text files

by paulnovl (Novice)
on May 07, 2008 at 23:30 UTC ( [id://685354]=perlquestion: print w/replies, xml ) Need Help??

paulnovl has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

The following code is supposed to check for "illegal" characters in a text file that I pass to it.
I define illegal as ordinates 0-8, 11-31 or >126.
BEGIN{ @ARGV == 1 or warn "\n\tUsage: $0 FILE\n\n" and exit 255; $in=shift; warn "Error: Can't read input file $in\n" and exit 255 if ! -s $in; } open (INFILE, $in) or die "Can't open $in for reading\n"; while (<INFILE>) { $line = $_; chomp $line; @character=split /\.*/; while (@character){ my $character = shift @character; $ord = ord($character); if ($ord>126) { print "$in contains illegal character \(ord:\ ", "$ord\) on line + $."; } elsif ($ord<32) { if ($ord<9) { print "$in contains illegal character \(ord:\ ", "$ord\) on li +ne $."; } if ($ord>10) { print "$in contains illegal character \(ord:\ ", "$ord\) on li +ne $."; } } } } close (INFILE);
That gives the expected results for most of my text files, e.g. a text file that contains ^Z on line 15 will yield:

test.txt contains illegal character (ord: 26) on line 15

However if the input text file contains a line commencing with "." (period), I get output like this:

test.txt contains illegal character (ord: 0) on line 23

Can anyone explain (in terms a newbie can understand) why it thinks a "." at the start of a line is a Null character?

Thanks.

Replies are listed 'Best First'.
Re: Rogue Null (ordinate 0) characters in text files
by ikegami (Patriarch) on May 07, 2008 at 23:56 UTC

    split /\.*/ is wrong. (So is split /.*/ which is probably what you intended.) You don't have a list of items seperated by 0 more or more periods. You want split //.

    split /\.*/ results in you only checking the first character of every line and the ones after periods, and it results in you doing ord('') (which returns zero) when the line starts with a period.

    There are other issues, though.

    • You chomp $line, but you split $_. Why do you have $line at all?
    • You really should use use strict; and use warnings;. As it is, you'll find some variables you forgot to scope using my.
    • while (@character){my $character = shift @character;
      can be written as
      for my $character (@character)
    • @character is singular even though it holds many characters.
    • There's no reason to use BEGIN where you did.
    • The -s is check is redundant with open.
    • warn and exit 255 can be replaced with die unless you specifically want code 255.
    • It helps to include the reason ($!) the open failed.
    • Your complex if can be collapsed.
    use strict; use warnings; @ARGV == 1 or die("Usage: $0 FILE\n"); my $in = shift; open(INFILE, '<', $in) or die("Can't open $in for reading: $!\n"); while (<INFILE>) { chomp; for my $char (split //) { my $ord = ord($char); if ( $ord < 9 || ($ord > 10 && $old < 32) || $ord > 126 ) { print("$in contains illegal character \(ord:\ ", "$ord\) on line + $.\n"); } } }

    It's usually better to use <> instead of opening the file yourself. This allows input to be read from STDIN if no filename is provided.

    use strict; use warnings; while (<>) { chomp; for my $char (split //) { my $ord = ord($char); if ( $ord < 9 || ($ord > 10 && $old < 32) || $ord > 126 ) { print("$Input contains illegal character \(ord:\ ", "$ord\) on l +ine $.\n"); } } }

    Furthermore, you could use a regexp instead of splitting.

    use strict; use warnings; while (<>) { chomp; while (/[^\x09\x20-\x7E]/g) { print("Input contains illegal character \(ord:\ ", "$ord\) on line + $.\n"); } }
      split /\.*/ results in you only checking the first character of every line and the ones after periods

      What you describe would happen with /\.+/  — as long as there are no periods, /\.*/ does split up individual characters just like //.   (Not saying that it should be used here, though...)

        > as long as there are no periods, /\.*/ does split up individual characters just like //. No, it doesn't.
        @c = split /\.+/, "abcdefg"; @d = split /\.+/, "ab.c.....defg";
        results in @c having one element, "abcdefg", and @d having 3 elemnts, "ab", "c", and "defg".
Re: Rogue Null (ordinate 0) characters in text files
by tachyon-II (Chaplain) on May 08, 2008 at 00:39 UTC

    Unless you really want to print every time you find a rougue char you can just use tr:

    $str = join '', map{chr}0..255; $str =~ tr/\11\12\40-\176//cd; # ascify aka remove all crap char +s print $str, $/;

    The /c is complement the list of desired characters, the /d deletes those that don't match. The good printable ASCII characters are specified in octol but you can use hex. To do what you are doing with a regex (here the chars are expressed as hex just to shown TIMTOWTDI):

    while (<DATA>) { while ( m/([^\x09\x0A\x20-\x7E])/g ) { printf "Found chr %d on line %d at pos %d\n", ord($1), $., pos +($_); } }
Re: Rogue Null (ordinate 0) characters in text files
by pc88mxer (Vicar) on May 07, 2008 at 23:37 UTC
    I would just use a regular expression:
    if ($line =~ m/([\0-\x8\xb-\x1f\x7e-\xff])/) { print "$line contains illegal character (ord: ", ord($1), ")\n"; }
    The reason why your code isn't working is because instead of:
    @character=split /\.*/; while (@character){
    you really want:
    @character=split //; # split $_ into single characters while (@character){
    With the original split call, when there's a dot at the beginning of the line the first element of @character is the empty string.
      Thanks to you all for those very useful replies.

      You've taught me a lot.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://685354]
Approved by pc88mxer
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (3)
As of 2024-06-19 22:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.