Beefy Boxes and Bandwidth Generously Provided by pair Networks Bob
Syntactic Confectionery Delight
 
PerlMonks  

Problem with a regex?

by TStanley (Canon)
on Jul 15, 2011 at 16:34 UTC ( #914645=perlquestion: print w/ replies, xml ) Need Help??
TStanley has asked for the wisdom of the Perl Monks concerning the following question:

I have a large report that I need to extract data from. The report can be broken down into records, with the start of each one looking similar to what is below:

REPORT HEADER ISCDAYRECAP-001 ISC001 ISC RECAP REPORT FOR STORE: 001 + PAGE: 00 1 XTNDED MRKDWN F +OR STORE: 12.00 R U N DEPT: GROCERY POST DATE: 07 +/14/2011 DATE/TIME: 07/14/2011 21:11:05 EXTEND + MRKDWN REASON EXT. MRKDWN

I am using the following code to split the file out into the separate stores, but it splits out into two elements, with element 0 of the array being empty, and everything else within element 1:

#!/usr/bin/perl -w use strict; open my $IN,"<","QISC001" or die "Can not open QISC001: $!\n"; my @records; my $data = do{ local $/; <$IN>; }; @records = split m|(?<=\n)(?=REPORT HEADER ISCDAYRECAP-\d{3})|, $data; close $IN;

One thing that I noticed is that when viewing the input file in vi (I am doing this on a HP-UX system), there is a ^L character at the start of each store, with the exception of the first one, so my guess is that the first part of the regex is incorrect. As always, suggestions/hints are welcome.


TStanley
--------
People sleep peaceably in their beds at night only because rough men stand ready to do violence on their behalf. -- George Orwell

Comment on Problem with a regex?
Select or Download Code
Re: Problem with a regex?
by Jim (Curate) on Jul 15, 2011 at 17:11 UTC

    The ^L is the FORM FEED control character. It's used to separate pages ("records") of the report.

    You can probably split on the FORM FEED character rather than on the text of the report header. Better yet, don't slurp the entire "large report" into memory, but instead process each report page one at a time by setting $/ ($INPUT_RECORD_SEPARATOR) to the FORM FEED character "\f".

    #!/usr/bin/perl use strict; use warnings; use autodie qw( open close ); use English qw( -no_match_vars ); # Report pages are separated by FORM FEED control characters local $INPUT_RECORD_SEPARATOR = "\f"; open my $report, '<', 'QISC001'; while (my $page = <$report>) { # Parse and transform each report page here... } close $report; exit 0;

    Jim

    UPDATE: You mentioned you're splitting the report into separate "stores." I presume this means you're carving the report into individual files, one per page. This script is untested, but it illustrates some general ideas you might find useful.

    #!/usr/bin/perl use strict; use warnings; use autodie qw( open close ); use English qw( -no_match_vars ); @ARGV == 1 or die "Usage: perl $PROGRAM_NAME <report file>\n"; # Report pages are separated by FORM FEED control characters local $INPUT_RECORD_SEPARATOR = "\f"; my $report_file = shift @ARGV; open my $report_fh, '<', $report_file; while (my $page = <$report_fh>) { my ($page_number, $store_number, $post_date) = $page =~ m{ PAGE:\s+(\d+) .+? STORE:\s+(\d+) .+? POST\s+DATE:\s+(\d\d/\d\d/\d\d\d\d) }msx; # For example, 07/14/2011 => 20110714 $post_date =~ s{(\d\d)/(\d\d)/(\d\d\d\d)}{$3$1$2}; # For example, 20110714-001-001.rpt my $page_file = sprintf "%s-%03d-%03d.rpt", $post_date, $store_number, $page_number; open my $page_fh, '>', $page_file; print {$page_fh} $page; close $page_fh; } close $report_fh; exit 0;
      This did the trick. Thanks for your help.

      TStanley
      --------
      People sleep peaceably in their beds at night only because rough men stand ready to do violence on their behalf. -- George Orwell
Re: Problem with a regex?
by Anonymous Monk on Jul 15, 2011 at 17:21 UTC
    It looks awfully fixed width ;)
    sub TenLinesToHash { my( $fh ) = @_; my %hash; my $line = <$fh>; chomp($line); $hash{header} = $line; $line = <$fh>; chomp($line); @hash{qw/ header2 ix1 page1 /} = unpack 'A30 A3 x33 x6 A3', $line }; ... return \%hash; }

      In my experience parsing and transforming printer files ("report scraping"), I've used regular expression pattern matching more often than substr or unpack. Why? Because there's no guarantee the report data will be consistently aligned in column positions. As it happens, items tend to drift left and right a bit, especially over the lifetime of a report that changes occassionally. Maybe the date was in column positions 33 through 42 for a few years, then somebody modified the report; thereafter, the date was in column positions 23 through 32. Obviously, there could be other variation over time besides the shifting left or right of report items, but this is precisely why, in general, I've found it better to start with regular expression pattern matching right out of the chute. It's more adaptable in the face of variation.

      I've also found it better (more understandable, more maintainable, etc.) to parse the report into pages or records first, and then to scrape the data from each page or record in a separate step, typically using a function that returns a list or hash of the parsed data.

      Jim

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://914645]
Approved by citromatik
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (6)
As of 2014-04-20 07:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (485 votes), past polls