Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Very weird things when printing (may be an encoding issue?)

by dottornomade (Initiate)
on May 30, 2013 at 00:29 UTC ( #1035937=perlquestion: print w/ replies, xml ) Need Help??
dottornomade has asked for the wisdom of the Perl Monks concerning the following question:

Hi there,

it's the first time something like this happens to me.

I have this flat file and simply... the lines gets all messed up when I am printing them on screen, I can't even parse them properly! This is very frustrating. Sadly, I can not attach the file here. When I open and visualize it with gedit, everything looks fine. Here is a snippet:

missense,0.40851449275362317,1.0,-100 2.853,2.853,5.706,2.853,2.853,8.559,8.559,... missense,0.40851449275362317,1.0,0 2.827,2.827,5.655,2.827,2.827,5.655,8.482,... frameshift,0.056074766355140186,0.7697841726618705,-64 5.290,1.763,0.000,8.817,1.763,3.527,1.763,... missense,0.44542772861356933,1.0,0

Basically it alternates two kinds of lines: a title line and a data line.

Now, say I want to print the first line:

open FILE, "<", "weird" or die $!; my @data = <FILE>; for (my $line=0; $line<1; $line++) { print $data[$line]; }

And it then prints ALL the odd lines! One after another, ignoring all the lines starting with >. I need to parse this file. When I split the lines with the split command, it considers the last element of an even line to be the whole following odd one, and it ignore the actual true last element. E.g., if I use:

my @tmp = split(',', $data[0]); print $tmp[$#tmp]."\n".$tmp[$#tmp-1];
The actual output is:
missense,0.40851449275362317,1.0, 2.853,2.853,5.706,2.853,2.853,8.559,8.559,... 0

Note that there is a \n in the middle of the first line, and a -100 value missing.

What's happening? This file was made by a bot interacting with a server...I guess it may be something related to the file encoding, but I have no idea about how to fix it.

Comment on Very weird things when printing (may be an encoding issue?)
Select or Download Code
Re: Very weird things when printing (may be an encoding issue?)
by JockoHelios (Scribe) on May 30, 2013 at 00:38 UTC
    If you are just trying to print the entire file, you don't need the line counters. Perl will keep track of where you are in the file and return the next line until the end of the file. I think the code below will do that, once you have FILE opened.

    From your post, I'm not sure what else you're trying to accomplish. Are you looking to parse the fields in the data line ?

    my @data = <FILE>; for $OneLine( @date) { print $OneLine; }
    Dyslexics Untie !!!
Re: Very weird things when printing (may be an encoding issue?)
by frozenwithjoy (Curate) on May 30, 2013 at 01:02 UTC
    One after another, ignoring all the lines starting with >.

    I don't see any > characters in your sample text. Are you sure it was copied/formatted properly in your post?

    Also, it doesn't look like you are actually splitting on commas what do you get when you include this:

    say "@tmp";

      My bad! Sorry, I was very tired when I wrote the message, there is no > in the lines. I meant: it just does print one line and ignore the following one.

Re: Very weird things when printing (may be an encoding issue?)
by Anonymous Monk on May 30, 2013 at 02:09 UTC
    Use  perl -MData::Dump -MFile::Slurp -e " dd scalar read_file shift, { qw/ binmode :raw / }; "  AnyKindOfInputFile > ThatFilesBytesAsPerlAsciiCode.pl to generate sample data to post to perlmonks
Re: Very weird things when printing (may be an encoding issue?)
by bioinformatics (Friar) on May 30, 2013 at 06:00 UTC
    So, you have output from a variant caller or database. Have you had to parse data from the bot previously, or is this a one-off file? Some ideas:

    1) Check the API documentation to see if it gives any hints.
    2) Check if the output file was made on a Mac. You can try setting $/ = "\12"; at the beginning of the script to test (it changes the delimiter from \n to \r).

    Bioinformatics

      Thank you guys for the replies!

      Now I don't feel alone anymore.

      What I have to do is parse the file, retrieve the values and do some math.

      I opened the file with hex editor and found that odd lines end with "OA" (LineFeed), and even lines end with "OD" (CarriageReturn). And so here I am again, seeking for Perl wisdom at the monastery.

        I opened the file with hex editor and found that odd lines end with "OA" (LineFeed), and even lines end with "OD" (CarriageReturn). And so here I am again, seeking for Perl wisdom at the monastery.

         perl -pi  -e " s/[\r\n]+/\n/g; " file
        dos2unix/Removing ^M char AKA dos2unix
        PerlIO::eol - PerlIO layer for normalizing line endings
        Text::FixEOL - Canonicalizes text to a specified EOL/EOF convention, repairing any 'mixed' usages
        File::LocalizeNewlines - Localize the newlines for one or more files

        In that case, use the CR to separate the records and the LF to split into lines;
        #!perl use strict; open IN,'<','weird.txt' or die "$!"; { local $/ = "\x0D"; while (<IN>){ my @lines = split /\x0A/; print $lines[0]."\n"; print $lines[1]."\n\n"; } }
        poj
Re: Very weird things when printing (may be an encoding issue?)
by t_rex_joe (Acolyte) on May 30, 2013 at 17:29 UTC
    I had a similiar problem with Cisco Discovery Protocol. When the neighbors (switches/routers) showed up they included a "NULL" 00 HEX in the description which threw a new line into the output screwing up my logs. Here's How I fixed it. Create a "sub clean" which converts the string to hex, flushes out the hidden strings, converts back to ascii then returns the sanitized string. This has since solved all my output values...
    $item = caschex(1, $item); #A->H if($debug == 1) { print "HEX: \"$item\"\n"; } $item = caschex(2, $item); #H->A if($debug == 1) { print "ASCII: \"$item\"\n"; } $item =~ s/^\s+//; #remove Leading whitespace $item =~ s/\s+$//; #remove trailing whitespace $item =~ s/\s+/ /g; #replace multiple spaces with one chomp($item); #remove newline character ######################################################## # Sub caschex # # USAGE: Removes hidden strings in variables.. Addition to Clean # # v1.0.0 -> 2006-04-20 # Born # ###### # my ($item) = cipdec(1, $ip); #1 = A->H, 2 = H->A + ###### # sub caschex { # 1 = ASCII TO HEX # 2 = HEX TO ASCII ########## my $debug = 0; ########## if($debug == 1) { print "---------------------------------- ENTERED +SUB: \"caschex\"\n"; } my $opt = undef; my $item = undef; my $ret = undef; my $val = undef; $opt = shift(@_); $item = shift(@_); if($debug == 1) { print "\n\n"; print "OPT: \"$opt\"\n"; print "ITEM: \"$item\"\n"; } ############################# OPT 1 if($opt == 1) { if($debug == 1) { print "CONVERT ASCII TO HEX\n"; } $key = undef; $val = undef; foreach $key (split//,$item) { if($debug == 1) { print "KEY: \"$key\"\n"; } ($key) = sprintf("%02lx", ord $key); if($debug == 1) { print "HEX KEY: \"$key\"\n"; } if(($key eq "00") || ($key eq "1b")) { if($debug == 1) { print "FOUND NULL IN ASCII VARIABLE... REPLA +CE WITH SPACE\n"; } $key = " "; #NOTE: A SPACE IN HEX IS "20" ($key) = sprintf("%02lx", ord $key); if($debug == 1) { print "HEX KEY: \"$key\"\n"; } } $val .= $key; if($debug == 1) { print "VAL: \"$val\"\n"; } } if($debug == 1) { print "COMPLETE VAL: \"$val\"\n"; } } ############################# EO OPT 1 ############################# OPT 2 if($opt == 2) { if($debug == 1) { print "CONVERT HEX TO ASCII\n"; } $key = undef; $val = undef; foreach $key ($item =~ /[a-fA-F0-9]{2}/g) { if($debug == 1) { print "HEX KEY: \"$key\"\n"; } if(($key eq "00") || ($key eq "1b")) { if($debug == 1) { print "FOUND NULL IN HEX.. REPLACE WITH SPAC +E\n"; } $key = 20; if($debug == 1) { print "HEX KEY: \"$key\"\n"; } } ($key) = chr(hex $key); if($debug == 1) { print "ASC KEY: \"$key\"\n"; } $val .= $key; if($debug == 1) { print "VAL: \"$val\"\n"; } } if($debug == 1) { print "COMPLETE VAL: \"$val\"\n"; } } ############################# EO OPT 2 $ret = $val; if($debug == 1) { print "RET: \"$ret\"\n"; } if($debug == 1) { print "---------------------------------- LEAVING +SUB: \"caschex\"\n"; } $debug = 0; return($ret); } # # ############################################## EO SUB CASCHEX

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1035937]
Approved by davido
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (19)
As of 2014-08-27 16:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (244 votes), past polls