Problem with a regex?

TStanley has asked for the wisdom of the Perl Monks concerning the following question:

I have a large report that I need to extract data from. The report can be broken down into records, with the start of each one looking similar to what is below:

REPORT HEADER ISCDAYRECAP-001 ISC001
 ISC RECAP REPORT FOR STORE: 001                                      
+  PAGE: 00
1
                                                       XTNDED MRKDWN F
+OR STORE:
     12.00            R U N
DEPT:    GROCERY                                         POST DATE: 07
+/14/2011
                     DATE/TIME: 07/14/2011  21:11:05

                                                               EXTEND 
+ MRKDWN
REASON                  EXT. MRKDWN
[download]

I am using the following code to split the file out into the separate stores, but it splits out into two elements, with element 0 of the array being empty, and everything else within element 1:

#!/usr/bin/perl -w
use strict;

open my $IN,"<","QISC001" or die "Can not open QISC001: $!\n";

my @records;

my $data = do{
  local $/;
  <$IN>;
};


@records = split m|(?<=\n)(?=REPORT HEADER ISCDAYRECAP-\d{3})|, $data;

close $IN;
[download]

One thing that I noticed is that when viewing the input file in vi (I am doing this on a HP-UX system), there is a ^L character at the start of each store, with the exception of the first one, so my guess is that the first part of the regex is incorrect. As always, suggestions/hints are welcome.

TStanley
--------
People sleep peaceably in their beds at night only because rough men stand ready to do violence on their behalf. -- George Orwell

Comment on Problem with a regex? Select or Download Code

Replies are listed 'Best First'.
Re: Problem with a regex? by Jim (Curate) on Jul 15, 2011 at 17:11 UTC
The ^L is the FORM FEED control character. It's used to separate pages ("records") of the report. You can probably `split` on the FORM FEED character rather than on the text of the report header. Better yet, don't slurp the entire "large report" into memory, but instead process each report page one at a time by setting `$/` (`$INPUT_RECORD_SEPARATOR`) to the FORM FEED character `"\f"`. `#!/usr/bin/perl use strict; use warnings; use autodie qw( open close ); use English qw( -no_match_vars ); # Report pages are separated by FORM FEED control characters local $INPUT_RECORD_SEPARATOR = "\f"; open my $report, '<', 'QISC001'; while (my $page = <$report>) { # Parse and transform each report page here... } close $report; exit 0;` [download] Jim UPDATE: You mentioned you're splitting the report into separate "stores." I presume this means you're carving the report into individual files, one per page. This script is untested, but it illustrates some general ideas you might find useful. #!/usr/bin/perl use strict; use warnings; use autodie qw( open close ); use English qw( -no_match_vars ); @ARGV == 1 or die "Usage: perl $PROGRAM_NAME <report file>\n"; # Report pages are separated by FORM FEED control characters local $INPUT_RECORD_SEPARATOR = "\f"; my $report_file = shift @ARGV; open my $report_fh, '<', $report_file; while (my $page = <$report_fh>) { my ($page_number, $store_number, $post_date) = $page =~ m{ PAGE:\s+(\d+) .+? STORE:\s+(\d+) .+? POST\s+DATE:\s+(\d\d/\d\d/\d\d\d\d) }msx; # For example, 07/14/2011 => 20110714 $post_date =~ s{(\d\d)/(\d\d)/(\d\d\d\d)}{$3$1$2}; # For example, 20110714-001-001.rpt my $page_file = sprintf "%s-%03d-%03d.rpt", $post_date, $store_number, $page_number; open my $page_fh, '>', $page_file; print {$page_fh} $page; close $page_fh; } close $report_fh; exit 0; [download]	[reply] [d/l] [select]
Re^2: Problem with a regex? by TStanley (Canon) on Jul 15, 2011 at 18:22 UTC
This did the trick. Thanks for your help. TStanley -------- People sleep peaceably in their beds at night only because rough men stand ready to do violence on their behalf. -- George Orwell	[reply]
Re: Problem with a regex? by Anonymous Monk on Jul 15, 2011 at 17:21 UTC
It looks awfully fixed width ;) `sub TenLinesToHash { my( $fh ) = @_; my %hash; my $line = <$fh>; chomp($line); $hash{header} = $line; $line = <$fh>; chomp($line); @hash{qw/ header2 ix1 page1 /} = unpack 'A30 A3 x33 x6 A3', $line }; ... return \%hash; }` [download]	[reply] [d/l]
Re^2: Problem with a regex? by Jim (Curate) on Jul 15, 2011 at 17:52 UTC
In my experience parsing and transforming printer files ("report scraping"), I've used regular expression pattern matching more often than `substr` or `unpack`. Why? Because there's no guarantee the report data will be consistently aligned in column positions. As it happens, items tend to drift left and right a bit, especially over the lifetime of a report that changes occassionally. Maybe the date was in column positions 33 through 42 for a few years, then somebody modified the report; thereafter, the date was in column positions 23 through 32. Obviously, there could be other variation over time besides the shifting left or right of report items, but this is precisely why, in general, I've found it better to start with regular expression pattern matching right out of the chute. It's more adaptable in the face of variation. I've also found it better (more understandable, more maintainable, etc.) to parse the report into pages or records first, and then to scrape the data from each page or record in a separate step, typically using a function that returns a list or hash of the parsed data. Jim	[reply] [d/l] [select]

Back to Seekers of Perl Wisdom