Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Splitting a file into records

by TStanley (Canon)
on May 03, 2013 at 15:56 UTC ( #1031918=perlquestion: print w/ replies, xml ) Need Help??
TStanley has asked for the wisdom of the Perl Monks concerning the following question:

I have the following text file:
STORE MONITORING REPORT as of 13-05-02 10:05:07 Scanning for FTFIMS STORE002 -rwxr-xr-- 1 admins store 59025 Apr 11 2012 eft100.cbr 16295 58 -rwxr-xr-- 1 admins store 61143 Nov 15 15:47 chk075.cbr 33334 60 -rwxr-xr-- 1 admins store 420952 Sep 6 2012 test-encrypt 63327 412 -rwxr-xr-- 1 admins store 427068 Sep 6 2012 eft115-20 36184 418 -rwxr-xr-- 1 admins store 460694 Apr 3 06:15 eft6un 07640 450 -rwxrwxrwx 1 admins store 481069 Oct 4 2012 hostsocgw 46087 470 -rwxrwxrwx 1 admins store 503666 Feb 13 09:10 stratgw 22452 492 -rwxr-xr-- 1 admins store 14318 Nov 1 2010 unityrep 50196 14 STORE006 -rwxr-xr-- 1 admins store 59025 Apr 11 2012 eft100.cbr 16295 58 -rwxr-xr-- 1 admins store 61143 Nov 15 15:47 chk075.cbr 33334 60 -rwxr-xr-- 1 admins store 420952 Sep 6 2012 test-encrypt 63327 412 -rwxr-xr-- 1 admins store 427068 Sep 6 2012 eft115-20 36184 418 -rwxr-xr-- 1 admins store 460694 Apr 3 06:15 eft6un 07640 450 -rwxrwxrwx 1 admins store 481069 Oct 4 2012 hostsocgw 46087 470 -rwxrwxrwx 1 admins store 503666 Feb 13 09:10 stratgw 22452 492 -rwxr-xr-- 1 admins store 14318 Nov 1 2010 unityrep 50196 14
What I would like to do is save the first two lines off to a file then proceed to break down the rest as records. Right now my code merely is trying to break them into records, and it looks like:
my ($TEXTIN,$HTMLOUT); my $input2 = $outputfile; my $output2 = "FTFIMS.html"; my @records=(); my $inrecord=0; my $rxRecStart = qr{STORE\d+}; my $rxRecStop = qr{\n\s}; my $recordStr = q{}; open $TEXTIN,"<",$input2 || die "Can not open $input2: $!\n"; open $HTMLOUT,">",$output2 || die "Can not open $output2: $!\n"; print $HTMLOUT "<html>\n<body>\n<pre>"; while(<$TEXTIN>){ if (m{$rxRecStart}){ $inrecord = 1; push @records, $recordStr if $recordStr; $recordStr = $_; }elsif(m{$rxRecStop}){ $inrecord = 0; push @records, $recordStr if $recordStr; $recordStr = q{}; }else{ $recordStr .= $_ if $inrecord; } } close $TEXTIN; print $HTMLOUT "$records[0]"; print $HTMLOUT "$records[1]"; print $HTMLOUT "\n</pre>\n</body>\n</html>"; close $HTMLOUT;
Right now, the above code is only printing the contents of $records[0], which is the first one that starts with STORE002. $records[1] is empty, which means my regular expressions for the record start and record stop are incorrect.
As always, give me a pointer in the right direction. Thanks.

TStanley
--------
People sleep peaceably in their beds at night only because rough men stand ready to do violence on their behalf. -- George Orwell

Comment on Splitting a file into records
Select or Download Code
Re: Splitting a file into records
by LanX (Canon) on May 03, 2013 at 16:11 UTC
    If I were you I would use the input record separator to split between headlines and records:

    $/="\n\n"; print my $head=<DATA>; while ( my $store = <DATA>) { my $listing =<DATA>; chomp($store,$listing); print "\n\n<<<$store>>>\n$listing"; } __DATA__ STORE MONITORING REPORT as of 13-05-02 10:05:07 Scanning for FTFIMS STORE002 -rwxr-xr-- 1 admins store 59025 Apr 11 2012 eft100.cbr 16295 58 -rwxr-xr-- 1 admins store 61143 Nov 15 15:47 chk075.cbr 33334 60 yadda yadda ... STORE006 -rwxr-xr-- 1 admins store 59025 Apr 11 2012 eft100.cbr 16295 58 -rwxr-xr-- 1 admins store 61143 Nov 15 15:47 chk075.cbr 33334 60 yadda yadda ...

    another approach would be using the flip-flop-operator aka Range Operators.

    Cheers Rolf

    ( addicted to the Perl Programming Language)

Re: Splitting a file into records
by Kenosis (Priest) on May 03, 2013 at 16:23 UTC

    I second LanX's suggestion (++) of using Perl's record separator ($/), but would initialize a local copy of it to 'STORE':

    use strict; use warnings; local $/ = 'STORE'; my $count = 1; while (<DATA>) { chomp; if (/MONITORING/) { s/^\s+//; s/\s+$//; print "First two lines:\n$/ $_\n\n"; } elsif (/\S/) { print "\n\nRecord " . $count++ . ":\n$/ $_"; } } __DATA__ STORE MONITORING REPORT as of 13-05-02 10:05:07 Scanning for FTFIMS STORE002 -rwxr-xr-- 1 admins store 59025 Apr 11 2012 eft100.cbr 16295 58 -rwxr-xr-- 1 admins store 61143 Nov 15 15:47 chk075.cbr 33334 60 -rwxr-xr-- 1 admins store 420952 Sep 6 2012 test-encrypt 63327 412 -rwxr-xr-- 1 admins store 427068 Sep 6 2012 eft115-20 36184 418 -rwxr-xr-- 1 admins store 460694 Apr 3 06:15 eft6un 07640 450 -rwxrwxrwx 1 admins store 481069 Oct 4 2012 hostsocgw 46087 470 -rwxrwxrwx 1 admins store 503666 Feb 13 09:10 stratgw 22452 492 -rwxr-xr-- 1 admins store 14318 Nov 1 2010 unityrep 50196 14 STORE006 -rwxr-xr-- 1 admins store 59025 Apr 11 2012 eft100.cbr 16295 58 -rwxr-xr-- 1 admins store 61143 Nov 15 15:47 chk075.cbr 33334 60 -rwxr-xr-- 1 admins store 420952 Sep 6 2012 test-encrypt 63327 412 -rwxr-xr-- 1 admins store 427068 Sep 6 2012 eft115-20 36184 418 -rwxr-xr-- 1 admins store 460694 Apr 3 06:15 eft6un 07640 450 -rwxrwxrwx 1 admins store 481069 Oct 4 2012 hostsocgw 46087 470 -rwxrwxrwx 1 admins store 503666 Feb 13 09:10 stratgw 22452 492 -rwxr-xr-- 1 admins store 14318 Nov 1 2010 unityrep 50196 14

    Output:

    First two lines: STORE MONITORING REPORT as of 13-05-02 10:05:07 Scanning for FTFIMS Record 1: STORE 002 -rwxr-xr-- 1 admins store 59025 Apr 11 2012 eft100.cbr 16295 58 -rwxr-xr-- 1 admins store 61143 Nov 15 15:47 chk075.cbr 33334 60 -rwxr-xr-- 1 admins store 420952 Sep 6 2012 test-encrypt 63327 412 -rwxr-xr-- 1 admins store 427068 Sep 6 2012 eft115-20 36184 418 -rwxr-xr-- 1 admins store 460694 Apr 3 06:15 eft6un 07640 450 -rwxrwxrwx 1 admins store 481069 Oct 4 2012 hostsocgw 46087 470 -rwxrwxrwx 1 admins store 503666 Feb 13 09:10 stratgw 22452 492 -rwxr-xr-- 1 admins store 14318 Nov 1 2010 unityrep 50196 14 Record 2: STORE 006 -rwxr-xr-- 1 admins store 59025 Apr 11 2012 eft100.cbr 16295 58 -rwxr-xr-- 1 admins store 61143 Nov 15 15:47 chk075.cbr 33334 60 -rwxr-xr-- 1 admins store 420952 Sep 6 2012 test-encrypt 63327 412 -rwxr-xr-- 1 admins store 427068 Sep 6 2012 eft115-20 36184 418 -rwxr-xr-- 1 admins store 460694 Apr 3 06:15 eft6un 07640 450 -rwxrwxrwx 1 admins store 481069 Oct 4 2012 hostsocgw 46087 470 -rwxrwxrwx 1 admins store 503666 Feb 13 09:10 stratgw 22452 492 -rwxr-xr-- 1 admins store 14318 Nov 1 2010 unityrep 50196 14

    Hope this helps!

Re: Splitting a file into records
by davido (Archbishop) on May 03, 2013 at 16:26 UTC

    ...or line by line, but build up records based on STORExxx:

    use strict; use warnings; use Data::Dumper; my @header = map { chomp( my $line = <DATA> ); $line } 0 .. 1; # Grab +1st 2 my $store = ''; my %records; while ( my $line = <DATA> ) { chomp $line; next unless length $line; if( $line =~ m/\s(STORE\d{3})/ ) { $store = $1; next; } $records{ $store } .= $line . "\n"; } print Dumper \@header; print Dumper \%records; __DATA__ STORE MONITORING REPORT as of 13-05-02 10:05:07 Scanning for FTFIMS STORE002 -rwxr-xr-- 1 admins store 59025 Apr 11 2012 eft100.cbr 16295 58 -rwxr-xr-- 1 admins store 61143 Nov 15 15:47 chk075.cbr 33334 60 -rwxr-xr-- 1 admins store 420952 Sep 6 2012 test-encrypt 63327 412 -rwxr-xr-- 1 admins store 427068 Sep 6 2012 eft115-20 36184 418 -rwxr-xr-- 1 admins store 460694 Apr 3 06:15 eft6un 07640 450 -rwxrwxrwx 1 admins store 481069 Oct 4 2012 hostsocgw 46087 470 -rwxrwxrwx 1 admins store 503666 Feb 13 09:10 stratgw 22452 492 -rwxr-xr-- 1 admins store 14318 Nov 1 2010 unityrep 50196 14 STORE006 -rwxr-xr-- 1 admins store 59025 Apr 11 2012 eft100.cbr 16295 58 -rwxr-xr-- 1 admins store 61143 Nov 15 15:47 chk075.cbr 33334 60 -rwxr-xr-- 1 admins store 420952 Sep 6 2012 test-encrypt 63327 412 -rwxr-xr-- 1 admins store 427068 Sep 6 2012 eft115-20 36184 418 -rwxr-xr-- 1 admins store 460694 Apr 3 06:15 eft6un 07640 450 -rwxrwxrwx 1 admins store 481069 Oct 4 2012 hostsocgw 46087 470 -rwxrwxrwx 1 admins store 503666 Feb 13 09:10 stratgw 22452 492 -rwxr-xr-- 1 admins store 14318 Nov 1 2010 unityrep 50196 14

    Dave

Re: Splitting a file into records
by hdb (Parson) on May 03, 2013 at 18:01 UTC

    A small variation on davido's proposal. Instead of storing the records in a hash I am using an array of hashes. Every time I encounter a new "STORE" line, I push a new hash reference onto the array. This way I do not need to store the store in a variable, as index -1 always refers to the last element. Just for completeness I put the header lines into the same structure.

    In order to make things look nicer, I also assumed that the lines of your data are not split but come from something like ls -l. Should this assumption be incorrect one has to change the third regex to m/^[dlrwx-]{10}|^\d+/. One could also merge the first and third regex into one as the action is identical but the code borders on obfuscation already...

    use strict; use warnings; use Data::Dumper; my @records = ( { store => 'header' } ); while ( my $line = <DATA> ) { chomp $line; push @{$records[-1]{lines}}, $line if $line =~ m/^STORE|^Scan/; push @records, { store => $1 } if $line =~ m/^\s(STORE\d{3})/; push @{$records[-1]{lines}}, $line if $line =~ m/^[-dlrwx]{10}/; + } print Dumper \@records; __DATA__ STORE MONITORING REPORT as of 13-05-02 10:05:07 Scanning for FTFIMS STORE002 -rwxr-xr-- 1 admins store 59025 Apr 11 2012 eft100.cbr 16295 58 -rwxr-xr-- 1 admins store 61143 Nov 15 15:47 chk075.cbr 33334 60 -rwxr-xr-- 1 admins store 420952 Sep 6 2012 test-encrypt 63327 41 +2 -rwxr-xr-- 1 admins store 427068 Sep 6 2012 eft115-20 36184 418 -rwxr-xr-- 1 admins store 460694 Apr 3 06:15 eft6un 07640 450 -rwxrwxrwx 1 admins store 481069 Oct 4 2012 hostsocgw 46087 470 -rwxrwxrwx 1 admins store 503666 Feb 13 09:10 stratgw 22452 492 -rwxr-xr-- 1 admins store 14318 Nov 1 2010 unityrep 50196 14 STORE006 -rwxr-xr-- 1 admins store 59025 Apr 11 2012 eft100.cbr 16295 58 -rwxr-xr-- 1 admins store 61143 Nov 15 15:47 chk075.cbr 33334 60 -rwxr-xr-- 1 admins store 420952 Sep 6 2012 test-encrypt 63327 41 +2 -rwxr-xr-- 1 admins store 427068 Sep 6 2012 eft115-20 36184 418 -rwxr-xr-- 1 admins store 460694 Apr 3 06:15 eft6un 07640 450 -rwxrwxrwx 1 admins store 481069 Oct 4 2012 hostsocgw 46087 470 -rwxrwxrwx 1 admins store 503666 Feb 13 09:10 stratgw 22452 492 -rwxr-xr-- 1 admins store 14318 Nov 1 2010 unityrep 50196 14
Re: Splitting a file into records
by Laurent_R (Vicar) on May 03, 2013 at 21:05 UTC

    Hi,

    you have been given some good solutions, I won't give you any other. I would just like to make some comments on your code.

    my ($TEXTIN,$HTMLOUT); my $input2 = $outputfile; my $output2 = "FTFIMS.html"; my @records=(); my $inrecord=0; my $rxRecStart = qr{STORE\d+}; my $rxRecStop = qr{\n\s}; my $recordStr = q{};

    It is not a very good idea to declare all your lexical variables at the top of your program, because you are essentially making them global to the whole file and this negates a large part of the advantages of lexical variables. Try to limit scope of variables to the enclosing block where they belong.

    open $TEXTIN,"<",$input2 || die "Can not open $input2: $!\n";

    This will not die if the program fails to open the file (say, if the file does not exists), because of precedence problems.

    You should either have parens:

    open ($TEXTIN, "<", $input2) || die "Can not open $input2: $!\n";

    or use the lower precedence operator or:

    open $TEXTIN, "<", $input2 or die "Can not open $input2: $!\n";

    But it would even be better to declare your filehandler within that statement:

    open my $TEXTIN, "<", $input2 or die "Can not open $input2: $!\n";

    Same thing of course for the other file opening statement.

    Otherwise, I think that the algorithm within your while loop is too complicated, error prone and not very robust. Especially, the $rxRecStop regexp is very weak and might match where you don't expect. Also, it will probably not match anything at the end of the file, so that the last section will not be recorded.

    Rather than having a beginning and end regexp, I think it would probably be better to have only one break regexp (the /STORE\d{3}/ is a good candidate). When you meet it, you do the house cleaning of the previous section (storing data) and the preparation of the next section (reinitializing the variables). Something like this:

    my $header = ""; $header .= <$TEXTIN> for 0..1; # record the first two lines for later +use while(<$TEXTIN>){ next if m/^\s*$/; # get rid of empty lines if (m{$rxRecStart}){ # do what you need to finish off the previous section # (saving the data) and start the new one } else { $recordStr .= $_; } } # add code here for storing the last section

    I just wanted to propose some improvements on the basis of your code, to help you think about it. I think that the solution with the modification of the input record separator proposed by Rolf and others is probably better.

    Edit: changed the way $header is assigned to prevent an "initialized value" warning.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1031918]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (9)
As of 2014-07-28 23:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (210 votes), past polls