Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Re: Splitting a file into records

by Laurent_R (Canon)
on May 03, 2013 at 21:05 UTC ( #1031959=note: print w/replies, xml ) Need Help??

in reply to Splitting a file into records


you have been given some good solutions, I won't give you any other. I would just like to make some comments on your code.

my ($TEXTIN,$HTMLOUT); my $input2 = $outputfile; my $output2 = "FTFIMS.html"; my @records=(); my $inrecord=0; my $rxRecStart = qr{STORE\d+}; my $rxRecStop = qr{\n\s}; my $recordStr = q{};

It is not a very good idea to declare all your lexical variables at the top of your program, because you are essentially making them global to the whole file and this negates a large part of the advantages of lexical variables. Try to limit scope of variables to the enclosing block where they belong.

open $TEXTIN,"<",$input2 || die "Can not open $input2: $!\n";

This will not die if the program fails to open the file (say, if the file does not exists), because of precedence problems.

You should either have parens:

open ($TEXTIN, "<", $input2) || die "Can not open $input2: $!\n";

or use the lower precedence operator or:

open $TEXTIN, "<", $input2 or die "Can not open $input2: $!\n";

But it would even be better to declare your filehandler within that statement:

open my $TEXTIN, "<", $input2 or die "Can not open $input2: $!\n";

Same thing of course for the other file opening statement.

Otherwise, I think that the algorithm within your while loop is too complicated, error prone and not very robust. Especially, the $rxRecStop regexp is very weak and might match where you don't expect. Also, it will probably not match anything at the end of the file, so that the last section will not be recorded.

Rather than having a beginning and end regexp, I think it would probably be better to have only one break regexp (the /STORE\d{3}/ is a good candidate). When you meet it, you do the house cleaning of the previous section (storing data) and the preparation of the next section (reinitializing the variables). Something like this:

my $header = ""; $header .= <$TEXTIN> for 0..1; # record the first two lines for later +use while(<$TEXTIN>){ next if m/^\s*$/; # get rid of empty lines if (m{$rxRecStart}){ # do what you need to finish off the previous section # (saving the data) and start the new one } else { $recordStr .= $_; } } # add code here for storing the last section

I just wanted to propose some improvements on the basis of your code, to help you think about it. I think that the solution with the modification of the input record separator proposed by Rolf and others is probably better.

Edit: changed the way $header is assigned to prevent an "initialized value" warning.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1031959]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (1)
As of 2018-02-25 14:30 GMT
Find Nodes?
    Voting Booth?
    When it is dark outside I am happiest to see ...

    Results (312 votes). Check out past polls.