Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

header footer

by gupr1980 (Acolyte)
on Mar 04, 2014 at 21:39 UTC ( #1076973=perlquestion: print w/replies, xml ) Need Help??

gupr1980 has asked for the wisdom of the Perl Monks concerning the following question:

I have a huge file > 3GB. It has multiple records with header and footer. Header and footer are fixed length. 50 and 30 characters long respectively. I need to remove those headers and footers from the file. I am a shell script guy and new to PERL. I read through some tutorials, played around a bit. I tried samples like the one below to find a character and delete. headers always start with HDR and footers start with FTR
use strict; use warnings; $^I = '.bak'; # create a backup copy while (<>) { s/HDR//g; # do the replacement print; # print to the modified file }
this delete the entire line not just the 50 characters Also, i am testing witha small file (3KB) and the while stores the file content in a variable. I am guessing it would be an issue for a file 2GB size. any suggestions on where i can go from here. Please help and hello the forum. This is going to be my home for the next few monhts.

Replies are listed 'Best First'.
Re: header footer
by kcott (Bishop) on Mar 04, 2014 at 22:13 UTC

    G'day gupr1980,

    Welcome to the monastery.

    Your substitution regex has little bearing on the task you describe. You might want to get up to speed on Perl's regular expresions by reading "perlrequick - Perl regular expressions quick start" and "perlretut - Perl regular expressions tutorial".

    Having said that, I see no reason to use regular expressions here: substr is quite capable of doing this (and I'd expect it to be a lot faster).

    Here's an example script, pm_1076973.pl, with much short header and footer lengths for demo purposes:

    #!/usr/bin/env perl use strict; use warnings; $^I = '.bak'; my $head_len = 5; my $foot_len = 3; while (<>) { chomp; print substr($_, $head_len, length($_) - $head_len - $foot_len), " +\n"; }

    Here's the starting files (before that script is run):

    $ ls -l pm_1076973.* -rwxr-xr-x 1 ken staff 210 5 Mar 08:50 pm_1076973.pl -rw-r--r-- 1 ken staff 87 5 Mar 08:46 pm_1076973.txt
    $ cat pm_1076973.txt 12345... content ...123 12345... more content ...123 12345... even more content ...123

    Now run the script:

    $ pm_1076973.pl pm_1076973.txt

    Now the files look like this:

    $ ls -l pm_1076973.* -rwxr-xr-x 1 ken staff 202 5 Mar 08:56 pm_1076973.pl -rw-r--r-- 1 ken staff 63 5 Mar 08:56 pm_1076973.txt -rw-r--r-- 1 ken staff 87 5 Mar 08:46 pm_1076973.txt.bak
    $ cat pm_1076973.txt ... content ... ... more content ... ... even more content ...
    $ cat pm_1076973.txt.bak 12345... content ...123 12345... more content ...123 12345... even more content ...123

    -- Ken

      Does your code assume there is just one header and footer in the file? If not, i am not seeing how it will go from one header/footer segment to another. And thanks for the reading links. It definitely helps.
        "Does your code assume there is just one header and footer in the file?"

        No it does not. Furthermore, given I put in a fair amount of effort to show the exact input and output, I'm surprised you're asking.

        "If not, i am not seeing how it will go from one header/footer segment to another."

        That sounds like you didn't even try it. Did you just have a quick look and decided it wouldn't work?

        Changing the lengths from my demo 5 and 3 to your real application requirements of 50 and 30, your sample records (posted below):

        HDR.S287878877.DDDDD.DDDDDDXXXXXXXXXXXXXXXXXXXXXXX1STR HYTRES NAME PLA +CE DEST GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG +GGGGGGGGGGGGGG1111111111111111111112222222222222222222222222222333333 +333333333333333333333444444444444444444444444444444444455 55555555555 +55555555555555555566666666666666777777777777FTRDDDDDDDDDDFFFFFFFFFFFF +FFFFF HDR.S287878877.DDDDD.DDDDDDXXXXXXXXXXXXXXXXXXXXXXX1STR HYTRES NAME PLA +CE DEST GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG +GGGGGGGGGGGGGG1111111111111111111112222222222222222222222222222333333 +333333333333333333333444444444444444444444444444444444455 55555555555 +55555555555555555566666666666666777777777777FTRDDDDDDDDDDFFFFFFFFFFFF +FFFFF

        become

        1STR HYTRES NAME PLACE DEST GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG +GGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG111111111111111111111222222222222222 +222222222222233333333333333333333333333344444444444444444444444444444 +4444455 5555555555555555555555555555566666666666666777777777777 1STR HYTRES NAME PLACE DEST GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG +GGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG111111111111111111111222222222222222 +222222222222233333333333333333333333333344444444444444444444444444444 +4444455 5555555555555555555555555555566666666666666777777777777

        -- Ken

Re: header footer
by Eily (Monsignor) on Mar 04, 2014 at 21:49 UTC

    You probably want to use substr instead of a regex, because there is a length parameter for counting from the start, or negative offset for counting from the end. And to only cut the first 50 chars in the first line, and last 30 in the last, you can use the variable $. (line number, which is 0, so false on the first line) and eof (which will be true on the last). So :

    RemoveHeader() unless $.; RemoveFooter() if eof;

    If you intend to use your script on several files at once, have a look at eof on how to reset $. at the start of each file.

      Found another example that doesnt store the entire file in a variable but reads it line by line like this,
      open ( my $input_fh, "<", $input_file ); open ( my $output_fh, ">", $output_file ); foreach my $line ( <$input_fh> ) { ########## i have to use regex here to first see if this line start with a HDR - +correct? then if it does i would do a substring to delete the first 50 and wr +ite the rest to the output file else write the entire line to output similarly check if it starts with EDR and substring again ########## } close ( $input_fh ); close ( $output_fh );
      I didnt quiet get your suggestion on using the line number and eof. The file will have numerous records that has headers and footers. Am I on the right track with the above approach? Storing in a variable vs reading line by line. Is one way better than the other?

        Yeah, I just missed the multiple records in the same file. Mea culpa

        Then you can do something like:

        %length = (HDR => 5, FTR => 10); while(<DATA>) { while(/(HDR|FTR)/) # find either HDR or FTR { substr($_, $-[1], $length{$1}) = ''; # $-[1] is the position of +the first capture groups (parenthesis) } print; } __DATA__ HDR--Hello this is a test FTR-------Should there be text here? HDR--Tw +o records on the same line FTR------- HDR--and here an incomplete footer FTR--

Re: header footer
by oiskuu (Hermit) on Mar 04, 2014 at 23:03 UTC

    I'd suggest dividing the problem into parts. A: reading of a single record; B: writing the modified record; ...

    Most of the difficulty is in the reading part, and we cannot really offer you much advice without learning all the details about the file format. Is the record of arbitrary length? Can a single record span megabytes? Is the record size encoded in the header?

    Update: May a record contain "HDR" or "FTR" in its body (as a substring)?

      no. It will only be there in the header and footer. you wont find a HDR or FTR in the body.
        hmm.. i just realized some of the responses are not shown in full. Just the header :) Sry i am reading through those now. Thanks.
      record is arbitrary in length yes. But like i said the header and footer lenght is always 50 and 30. Individual record is only about 10kb. But there are way too many records. record size is not encoded in the header.
        so am i missing something in thinking that read line by line check if pattern HDR exists - cut from end of header to rest of line and > outfile rest of lines > outfile keep going till find pattern > FTR - cut from FTR to end of line > outfile would this not work?
Re: header footer
by Laurent_R (Canon) on Mar 04, 2014 at 22:33 UTC
    Hmm, I have the feeling that there are some misunderstandings here. It seems that there are several headers and several footers, not just one of each in the file. Possibly even one header and one footer for each record. It is also not entirely clear whether the headers and footers are on the same line at the data. Can you please give a sample of your file so that we can understand better its structure?

      Oups, indeed.

      Well then maybe setting the input file separator ($/) to "HDR" would work then.

      Or s/HDR.{47}|FTR.{27}//g;, but it doesn't appeal much to me.

Re: header footer
by Kenosis (Priest) on Mar 04, 2014 at 22:34 UTC

    Can you share a (redacted, if necessary) copy of one of those records? Are the headers and footers the same for each record? Also, what separates those records?

      The file will have 1000s of records each record has its own header and footer. The header and footer can be different for each record. Only thing is that the header and footer are of fixed lenght 50 and 30.
      HDR.S287878877.DDDDD.DDDDDDXXXXXXXXXXXXXXXXXXXXXXX1STR HYTRES NAME PLA +CE DEST GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG11111111111111111111122222222222222222222222222223 +33333333333333333333333333444444444444444444444444444444444455 5555555555555555555555555555566666666666666777777777777FTRDDDDDDDDDDFF +FFFFFFFFFFFFFFF HDR.S287878877.DDDDD.DDDDDDXXXXXXXXXXXXXXXXXXXXXXX1STR HYTRES NAME PLA +CE DEST GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG11111111111111111111122222222222222222222222222223 +33333333333333333333333333444444444444444444444444444444444455 5555555555555555555555555555566666666666666777777777777FTRDDDDDDDDDDFF +FFFFFFFFFFFFFFF
      Sry if there was a better way to send a sample file. I copied the same record twice for simplicity sake but they will be different but leaght will be 50 and 30 and each record will have it until the end of the file.
        Just to clarify

        HDR.S287878877.DDDDD.DDDDDDXXXXXXXXXXXXXXXXXXXXXXX

        is the header and

        FTRDDDDDDDDDDFFFFFFFFFFFFFFFFF

        is the footer

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1076973]
Approved by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (2)
As of 2021-10-21 19:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My first memorable Perl project was:







    Results (83 votes). Check out past polls.

    Notices?