Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Weird Character in File Makes Perl Think it's EOF

by davemabe (Monk)
on Oct 15, 2008 at 21:01 UTC ( [id://717343]=perlquestion: print w/replies, xml ) Need Help??

davemabe has asked for the wisdom of the Perl Monks concerning the following question:

I've got some files that I'm trying to parse and there's a \032 character (ASCII 26 (EOF)) in there that is making Perl think that it's the end of the file. How can I strip it out? Everything I try just stops there since it thinks it's the end of the file. I thought for sure there was a special variable that controlled EOF (like $/ for end of line) but I can't seem to find it.
  • Comment on Weird Character in File Makes Perl Think it's EOF

Replies are listed 'Best First'.
Re: Weird Character in File Makes Perl Think it's EOF
by Fletch (Bishop) on Oct 15, 2008 at 21:07 UTC

    Sounds like someone's on win32 and trying to read a binary file in text mode. A judicious application of binmode is probably called for.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      Yes, we had the exact same problem where I work, and binmode solved it.

Re: Weird Character in File Makes Perl Think it's EOF
by wol (Hermit) on Oct 16, 2008 at 12:14 UTC
    It sounds like binmode would enable you to read past byte value 26, but then you lose the handling of the platform dependent bytes indicating end-of-line.

    This is only an issue if your file is supposed to be treated a text file with new-lines in it.

    The existence of a byte that doesn't usually appear in text files suggests that you're processing a binary file anyway, but can you confirm whether this is the case?

    --
    .sig : File not found.

Re: Weird Character in File Makes Perl Think it's EOF
by CountZero (Bishop) on Oct 16, 2008 at 08:06 UTC
    I've got some files that I'm trying to parse

    Can you show us your code for reading these files? As there are many ways to read a file, it is impossible to help you debug your program if we do not know what is in your program.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Weird Character in File Makes Perl Think it's EOF
by ikegami (Patriarch) on Oct 19, 2008 at 21:13 UTC

    Perl hasn't behaved that way for 6.5 years.

    >perl -e"print qq{abc\n\cZdef\n}" | c:\progs\perl560\bin\perl -e"while + (<>) { chomp; print qq{[$_]\n} }" [abc] >perl -e"print qq{abc\n\cZdef\n}" | c:\progs\perl561\bin\perl -e"while + (<>) { chomp; print qq{[$_]\n} }" [abc] >perl -e"print qq{abc\n\cZdef\n}" | c:\progs\perl580\bin\perl -e"while + (<>) { chomp; print qq{[$_]\n} }" [abc] [&#8594;def] >perl -e"print qq{abc\n\cZdef\n}" | c:\progs\perl588\bin\perl -e"while + (<>) { chomp; print qq{[$_]\n} }" [abc] [&#8594;def] >perl -e"print qq{abc\n\cZdef\n}" | c:\progs\perl5100\bin\perl -e"whil +e (<>) { chomp; print qq{[$_]\n} }" [abc] [&#8594;def]

    Time to upgrade!

    5.8.0 is when PerlIO started being used (by ActiveState, at least). Under PerlIO, files are read in "as binary", then they are "converted to text" by the crlf layer if present. The crlf layer doesn't treat chr(26) specially like the old library did.

    Note: &#8594; represents character 26.

Re: Weird Character in File Makes Perl Think it's EOF
by periapt (Hermit) on Oct 17, 2008 at 20:32 UTC

    You could certainly read the file in using binmode but, as wol noted, you do loose end-of-line handling. Depending on what is happening with your file before the parsing stage, you may want to try preprocessing it before the parse step.

    Assuming that your file should only have word characters in it (as defined by \w = [a-zA-Z0-9_]), you could try this one-liner

    perl -i.orig -p -e "s/\W+/?/g;" <yourfile>

    This will rename the original file <yourfile>.orig and change every occurance of a non-word character to a question mark. I am assuming here that you want to retain the relative location of the offending byte. If you don't, simply write s/\W+// instead of s/\W+/?/.

    If you wanted to write the output to STDOUT say before passing the data to another process you can omit the -i.orig flag

    Of course, you could do it with sed or gawk but this is PerlMonks ;o).


    PJ
    use strict; use warnings; use diagnostics;
      My pre-processing suggestion would be to use tr:

      tr -d "\032" < infile > outfile

      ...or...

      tr "\032" " " < infile > outfile

      If you use Gawk, you have to set its BINMODE.

      Using ActivePerl for Windows, I've never had to use binmode to handle nasty ASCII control characters like NUL (0x00) and SUB (0x1A). It seems to read and write them in text mode just fine.

      D:\>perl -e "print qq{\x00\x1A\nfoo\nbar\x1A\x00\n}"  foo bar D:\>perl -e "print qq{\x00\x1A\nfoo\nbar\x1A\x00\n}" | perl -ne "print if m/foo/" foo D:\>
      Jim

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://717343]
Approved by grep
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (2)
As of 2024-04-24 17:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found