Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

end of line help

by p789123 (Initiate)
on Nov 17, 2010 at 19:58 UTC ( #872045=perlquestion: print w/ replies, xml ) Need Help??
p789123 has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I am not a programmer by any stretch of the imagination but was recently volunteered to anonymize several thousand binary files, which involves replacing a 1,2,or 3 digit number with a different 1,2,or 3 digit number. so i thought now would be a good time to learn perl.

After some experimentation I have a script that does what I want but the output files are recognized as corrupted by the software I use to compile them. They are each about 4kb larger than original, and when I look at them in hexeditor I see original had 0a line separator and the output has 2 characters I think it was 0a 0d. I am working in win xp with strawberry; not sure what os original files were on.

After more reading I thought I could use -l012 on command line but that had the effect of adding additional blank lines, effectively double spacing the file.
thanks. Heres my code:

#!/usr/local/bin/perl -w #use strict; #####Point these paths to original cel files and an empty new folder $oripath = "r://cel file anonymization//original cel files//"; $anonpath = "r://cel file anonymization//anonymized cel files//"; #####read in the link file; store it in array and close the file open(linkfile,"r://B27 cel file anonymization//B27anonymizationlinkfil +e.txt") or die ("CAnt open link file!\n"); @link=<linkfile>; shift(@link); close(linkfile); #####loop through the cel files, replacing the names and creating new +anonymized cels foreach $sample (@link) { chomp($sample); ($infile,$labno,$out,$anonno)=split(/\t/,$sample); print("Anonymizing $infile\n"); open (INFILE, $oripath.$infile) or die "can't open file $oripath $ +infile $!" ; binmode (INFILE); open OUTPUTFILE, ">", $anonpath.$out or die "cant open outputfile" +; while(<INFILE>) { $_=~s?B27-\d\d\d|B27-\d\d|B27-\d?B27-$anonno?g; print OUTPUTFILE $_; } close (INFILE); close (OUTPUTFILE); }

Comment on end of line help
Download Code
Re: end of line help
by roboticus (Canon) on Nov 17, 2010 at 20:25 UTC

    p789123:

    Try adding binmode(OUTPUTFILE); just after you open it (just like you do for the other file handle). It's common in Windows machines to use 0x0d 0x0a (\r\n) as line endings in files, so perl may translate it to Windows mode on output unless you tell it not to.

    ...roboticus

Re: end of line help
by ww (Bishop) on Nov 17, 2010 at 21:54 UTC
    ... and uncomment that #use strict;.

    Strict exists to help you; to highlight typos, for example that have you acting on a variable different than the one you intended.

    And just to make your life (and learning) easier, enable warnings as well

Re: end of line help
by aquarium (Curate) on Nov 18, 2010 at 00:44 UTC
    that's a pretty good effort for a perl beginner.
    i know you're further down the track as you indicated however, i do question what is read into the array from the linkfile and what is really written. that's because you have not set the infile and outfile line endings..which is dodgy anyway is you're processing a true binary file, which has no line endings. and along the same lines, binary data is usually not suitable to split on a tab character, as that can imply encoding..which there is none for binary file. in case of a "true/real" binary file, you'd binmode both input and output, set input and output line endings to nothing (just to be sure they don't get appended somewhere automatically), sysread data into buffer and unpack, and then repack for the write. and you have to make sure that each bit of text that is replaced is exactly same length in bytes as on input. a regex with a scalar variable substitution is liable to break string length consistency unless done carefully.
    but like i said, you seem to be further down the track, and only you know the true nature of the binary file involved. enjoy perl.
    the hardest line to type correctly is: stty erase ^H
Re: end of line help
by oko1 (Deacon) on Nov 18, 2010 at 02:18 UTC

    > {...} not sure what os original files were on.

    In that case, I would strongly suggest not treating these files as text - which is what you're doing with 'while(<INFILE>)' (which reads from the filehandle one "line" at a time - the definition of 'line' being determined by the value of $/.)

    You've mentioned that these files are binary. To me, that suggests that they have some kind of structure. If so, you'd probably be better off either parsing them and modifying only the part that you want (assuming you know the data structure) or slurping the entire file as a string - assuming that they're a relatively small proportion of your memory size - and modifying the string. Something like this, maybe:

    #!/usr/bin/perl -w use strict; open my $Link, "linkfile.txt" or die "linkfile.txt: $!\n"; while (<$Link>){ next if $. == 1; # Skip the first line chomp; my ($infile, $labno, $out, $anonno) = split /\t/; print "Anonymizing $infile\n"; open my $In, $infile or die "$infile: $!\n"; binmode $In; my $content = do { local $/; <$In> }; close $In; $content =~ s?B27-\d{1,3}?B27-$anonno?gsm; open my $Out, ">", $out or die "$out: $!\n"; binmode $Out; print $Out $content; close $Out; }

    --
    "Language shapes the way we think, and determines what we can think about."
    -- B. L. Whorf
      Definitely agree with the idea of slurping the whole file into a string in binary mode, assuming it fits - or maybe there is a variant of Tie::File for binary files that might help here (I did a quick search on CPAN and nothing jumped out)?

      In any case, if you are transferring files between systems, and possibly different OS's using something like FTP - make sure you use binary mode for the transfer.

Re: end of line help
by p789123 (Initiate) on Nov 18, 2010 at 12:29 UTC
    Thanks to all for your help! Adding binmode(outputfile) solved my immediate problems.
    I am not sure if the file meets your definitions of binary, as it does have line endings. In a text editor the first few lines are readable, then a few thousand lines of jibberish. And just to clarify, the 'binary' files are not tab delimited, the first infile reads a spreadsheet with the original and replacement values.

    I commented the strict because it gave me too many errors and the easiest solution I could find was to turn it off. I know the logic there is terrible and as I have time will try to learn the correct syntax.
    thanks again!!!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://872045]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2014-09-19 07:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (132 votes), past polls